mapbox / dynamodb-replicator

module for dynamodb multi-region replication
ISC License
129 stars 48 forks source link

dyno failing to restore incremental backup? #85

Open keen99 opened 7 years ago

keen99 commented 7 years ago

simple test case with a small table:

ENV=dev
TABLE=dsrtest2

. config.env.$ENV

bin/incremental-backfill.js $AWS_REGION/$TABLE s3://$BackupBucket/$BackupPrefix

bin/incremental-snapshot.js s3://$BackupBucket/$BackupPrefix/$TABLE  s3://$BackupBucket/${TABLE}-snapshot

s3print s3://$BackupBucket/${TABLE}-snapshot | dyno put $AWS_REGION/dsr-test-restore-$TABLE
%% sh test-backup.sh
12 - 11.89/s[Fri, 09 Dec 2016 23:54:59 GMT] [info] [incremental-snapshot] Starting snapshot from s3://dsr-ddb-rep-testing/testprefix/dsrtest2 to s3://dsr-ddb-rep-testing/dsrtest2-snapshot
[Fri, 09 Dec 2016 23:55:01 GMT] [info] [incremental-snapshot] Starting upload of part #0, 0 bytes uploaded, 12 items uploaded @ 6.26 items/s
[Fri, 09 Dec 2016 23:55:01 GMT] [info] [incremental-snapshot] Uploaded snapshot to s3://dsr-ddb-rep-testing/dsrtest2-snapshot
[Fri, 09 Dec 2016 23:55:01 GMT] [info] [incremental-snapshot] Wrote 12 items and 148 bytes to snapshot
undefined:1
�
^

SyntaxError: Unexpected token  in JSON at position 0
    at Object.parse (native)
    at Function.module.exports.deserialize (/Users/draistrick/git/github/dynamodb-replicator/node_modules/dyno/lib/serialization.js:49:18)
    at Transform.Parser.parser._transform (/Users/draistrick/git/github/dynamodb-replicator/node_modules/dyno/bin/cli.js:94:25)
    at Transform._read (_stream_transform.js:167:10)
    at Transform._write (_stream_transform.js:155:12)
    at doWrite (_stream_writable.js:307:12)
    at writeOrBuffer (_stream_writable.js:293:5)
    at Transform.Writable.write (_stream_writable.js:220:11)
    at Stream.ondata (stream.js:31:26)
    at emitOne (events.js:96:13)

Next step would be to diff the two tables - but the pipe to dyno fails. I've tried 1.0.0 and 1.3.0 with the same result.

What data format is dyno expecting? The file on s3 (tried multiple tables including real data tables) is a binary blob?

cheese:~%% aws --region=us-west-2 s3 cp s3://dsr-ddb-rep-testing/dsrtest-snapshot -
m�1�
��ߠl�EG�EB�uL0\�Tuq�ݵ#������$L�6�/8�%Z�r�[d�p
���5h)��X�ֻ�j�ƪ�
 ۘ��&�WJ'❑��`�T�������􁒷
cheese:~%%

So maybe this is a problem with backfill? or I'm missing something? :)

2016-12-09 18:54:35 149 dsrtest-snapshot 2016-12-09 18:55:01 148 dsrtest2-snapshot 2016-12-09 18:37:20 1428 receipt_log_dev-01-snapshot 2016-12-09 18:53:15 13457328 showdownlive_dev-01-snapshot

keen99 commented 7 years ago

oh.

cheese:~%% aws --region=us-west-2 s3 cp s3://dsr-ddb-rep-testing/dsrtest-snapshot -|gzcat
{"a":{"S":"b"},"what":{"S":"new10"}}
{"b":{"S":"ccd"},"what":{"S":"a"}}
{"aa":{"S":"bb"},"what":{"S":"asdf"}}
{"a":{"S":"11"},"what":{"S":"new2"}}
{"a":{"S":"asdf"},"what":{"S":"sdfg"}}
{"what":{"S":"test2"}}
{"a":{"S":"fish faster 8"},"what":{"S":"new"}}
{"a":{"S":"bb"},"what":{"S":"bb"}}
{"a":{"S":"b"},"what":{"S":"new1"}}
{"a":{"S":"aa"},"b":{"S":"cc"},"what":{"S":"b"}}
{"a":{"S":"test1"},"what":{"S":"test"}}
{"what":{"S":"test4"}}
keen99 commented 7 years ago

that still doesnt work - s3print | gzip fails, apprently s3print is outputting something extra...

%% s3print s3://$BackupBucket/${TABLE}-snapshot | gzcat
{"what":{"S":"new10"},"a":{"S":"b"}}
{"what":{"S":"test4"}}
{"what":{"S":"new2"},"a":{"S":"11"}}
{"b":{"S":"ccd"},"what":{"S":"a"}}
{"what":{"S":"new1"},"a":{"S":"b"}}
{"aa":{"S":"bb"},"what":{"S":"asdf"}}
{"b":{"S":"cc"},"what":{"S":"b"},"a":{"S":"aa"}}
{"what":{"S":"new"},"a":{"S":"fish faster 8"}}
{"what":{"S":"sdfg"},"a":{"S":"asdf"}}
{"what":{"S":"test2"},"a":{"S":"asdf"}}
{"what":{"S":"test"},"a":{"S":"test1"}}
{"what":{"S":"bb"},"a":{"S":"bb"}}
gzcat: (stdin): trailing garbage ignored

but

aws s3 cp s3://$BackupBucket/${TABLE}-snapshot - | gzcat | dyno put $AWS_REGION/dsr-test-restore-$TABLE

almost works - except dyno requires the table to already exist.

I guess we're not storing table create details with the backups, so we can't actually directly do a restore to a new table - have to discover the old table's setup and recreate that first. :/

rclark commented 7 years ago

dyno's export CLI writes a similar snapshot with a table description as the first line, and then dyno's import CLI can read that table description and create a new table. This doesn't exactly help for the incremental snapshot case (which doesn't even have knowledge of the table schema), but perhaps there's some code over there that can help you with a pipeline that utilizes dyno import?

keen99 commented 7 years ago

thanks. seems like the logical process for incremental and snapshots from those would be to to have the streams triggered function just maintain a describeTable object in s3 that's updated every time the trigger is run, then the snapshot creation would include that, and we could just use dyno import to load it into a new table? rolling back through time through the versioned bucket would always have the corresponding table schema (not that this can really change, I suppose. but maybe there are important parts of config that matter that can change)

keen99 commented 7 years ago

ok, I've got working logic to extract table descriptions at the time of the lambda events and store them - while the only thing that really might be changing is scaling limits, if I'm recreating a point in time, I'd rather have the whole point in time. :)

Would you be interested in this as a PR? It currently store the description at bucket/prefix/tablename.description and would be non-impacting (except for IAM policy updates to read it, I guess) for existing workflows.

My next step is to update the s3-snapshot.js code to include the description in a form that dyno import can handle.

If you want this as a PR - would you prefer this to NOT attach the description if it doesnt exist? My options seem to be: no description, description from s3, or current live description if there isn't one on s3.

Not sure what the best approach would be for existing consumers - so figured I'd ask. Maybe another cli arg for the different path? Dunno.

I'll start working on that tomorrow for my needs - but happy to make it portable if you've got a direction you'd prefer.

rclark commented 7 years ago

I'm glad you're finding something that can work for you here.

There are a couple of issues that I could imagine coming up with the approach you're pursuing:

  1. Are you writing the table description every time lambda is invoked? This could cause throttling on the DynamoDB DescribeTable API and/or S3 PutObject API for tables with very high write load.

  2. On tables with very large record counts, we've found that we have to perform the snapshot spread across several processes. Each process ends up being responsible for scraping the S3 objects that start with names starting with 0, 1, ... up to f. At the end, a "reduce" process rolls up the results of each of the individual processes into a single snapshot file. I am imagining that having the tablename.description record in the midst of the incremental records might mess with that final rollup step.

Generally speaking, we've used the snapshots for point-in-time restores of individual records, rather than wholesale restores on the entire database. I do feel like there's some appeal towards keeping the incremental backup step (that happens on Lambda) responsible for nothing more than that one thing. What if the DescribeTable request was made alongside the s3-snapshot.js call, perhaps optional?