tbarbugli / cassandra_snapshotter

A tool to backup cassandra nodes using snapshots and incremental backups on S3
Other
222 stars 122 forks source link

Strange boto error on `upload_node_backups` #54

Closed winks closed 8 years ago

winks commented 8 years ago

Hi, I just started using cassandra_snapshotter and I got a similar error twice when running it on my cluster (4 nodes, each seems to have ~18GB of data)

[172.31.10.42] out: boto.exception.S3ResponseError: S3ResponseError: 200 OK

Full log of the run is here

Here are outputs of s3cmd du (converted to GB) on the path/CLUSTERNAME/IP folders in s3, as you can see, the 1st pair is missing some data while on my first try (2nd pair) it's not so obvious.

17.74581958632916212081 run 1, node 1
12.03844990301877260208 run2

17.30710690561681985855 run 1, node 2
17.32890863530337810516 run 2

18.47752564307302236557 run 1, node 3
18.50679215136915445327 run 2

17.72296549659222364425 run 1, node 4
17.74818217754364013671 run 2

TLDR: I think either there's something wrong with S3 atm, or cassandra_snapshotter should really retry to upload the snapshot if there's a boto error...

Not sure if this issue is related.

tbarbugli commented 8 years ago

did you retry? if yes, did it work?

On Fri, Sep 25, 2015 at 9:06 AM Florian Anderiasch notifications@github.com wrote:

Hi, I just started using cassandra_snapshotter and I got a similar error twice when running it on my cluster (4 nodes, each seems to have ~18GB of data)

[172.31.10.42] out: boto.exception.S3ResponseError: S3ResponseError: 200 OK

Full log of the run is here https://gist.github.com/winks/989ff567c11879646d3e

Here are outputs of s3cmd du on the path/CLUSTERNAME/IP folders in s3, as you can see, the 1st pair is missing some data while on my first try (2nd pair) it's not so obvious.

17.74581958632916212081 run 1, node 1 12.03844990301877260208 run2

17.30710690561681985855 run 1, node 2 17.32890863530337810516 run 2

18.47752564307302236557 run 1, node 3 18.50679215136915445327 run 2

17.72296549659222364425 run 1, node 4 17.74818217754364013671 run 2

TLDR: I think either there's something wrong with S3 atm, or cassandra_snapshotter should really retry to upload the snapshot if there's a boto error...

— Reply to this email directly or view it on GitHub https://github.com/tbarbugli/cassandra_snapshotter/issues/54.

winks commented 8 years ago

As I said, I did 2 runs and both failed (but on only one host each, and it were different ones). I wanted to wait until tomorrow to run another one, will come back with results then.

tbarbugli commented 8 years ago

I doubt S3 is having an outage, there is probably something else wrong here. Do you specify the S3 region name? What region are you using?

On Fri, Sep 25, 2015 at 11:35 AM Florian Anderiasch < notifications@github.com> wrote:

As I said, I did 2 runs and both failed (but on only one host each, and it were different ones). I wanted to wait until tomorrow to run another one, will come back with results then.

— Reply to this email directly or view it on GitHub https://github.com/tbarbugli/cassandra_snapshotter/issues/54#issuecomment-143297012 .

winks commented 8 years ago

My instances are all in eu-west-1a and I've been using --s3-bucket-region=eu-west-1

Full command used is

cassandra-snapshotter --aws-access-key-id=REDACTED --aws-secret-access-key=SECRET \
--s3-bucket-region=eu-west-1 --s3-bucket-name=COMPANY-cassandra-backup \
--s3-base-path=prod/COMPANY_cluster/$(date +%u) backup \
--hosts=172.31.5.185,172.31.6.149,172.31.5.20,172.31.10.42 --user=cassandra \
--cassandra-conf-path=/etc/cassandra --backup-schema
tbarbugli commented 8 years ago

Make sure your S3 bucket is in the region you specified in the command (you can see that from the web console).

On Fri, Sep 25, 2015 at 1:27 PM Florian Anderiasch notifications@github.com wrote:

My instances are all in eu-west-1a and I've been using --s3-bucket-region=eu-west-1

Full command used is

cassandra-snapshotter --aws-access-key-id=REDACTED --aws-secret-access-key=SECRET \ --s3-bucket-region=eu-west-1 --s3-bucket-name=COMPANY-cassandra-backup \ --s3-base-path=prod/COMPANY_cluster/$(date +%u) backup \ --hosts=172.31.5.185,172.31.6.149,172.31.5.20,172.31.10.42 --user=cassandra \ --cassandra-conf-path=/etc/cassandra --backup-schema

— Reply to this email directly or view it on GitHub https://github.com/tbarbugli/cassandra_snapshotter/issues/54#issuecomment-143331455 .

winks commented 8 years ago

Yes, it says Ireland, so eu-west-1 afaik.

winks commented 8 years ago

Hmm, according to these S3 docs

Processing of a Complete Multipart Upload request could take several minutes to complete. After Amazon S3 begins processing the request, it sends an HTTP response header that specifies a 200 OK response. While processing is in progress, Amazon S3 periodically sends whitespace characters to keep the connection from timing out. Because a request could fail after the initial 200 OK response has been sent, it is important that you check the response body to determine whether the request succeeded.

In agent.py#L101 nothing is checked and this seems to be exactly in the loop where I have an error:

for _ in pool.imap(upload_file, ((bucket, f, destination_path(s3_base_path, f), s3_ssenc, buffer_size) for f in files)):

winks commented 8 years ago

Ran it again now and everything worked fine (4x18GB). I'm gonna keep an eye on this and scan the logfiles for a few days, if it pops up again I'll add a catch/retry in the code above.

winks commented 8 years ago

OK, so now I've had 3 successful daily backups in a row - I think this can be closed and it looks like a fluke in S3, because I did not change anything. Same bucket, same region, same hosts.

Maybe it's still worth doublechecking that line of code I linked.

winks commented 8 years ago

Without changing anything, after I said it had run 3 times, it stopped working completely, but with a different error:

[172.31.5.20] out: Traceback (most recent call last):
[172.31.5.20] out:   File "/usr/local/bin/cassandra-snapshotter-agent", line 9, in <module>
[172.31.5.20] out:     load_entry_point('cassandra-snapshotter==0.4.0', 'console_scripts', 'cassandra-snapshotter-agent')()
[172.31.5.20] out:   File "/usr/local/lib/python2.7/dist-packages/cassandra_snapshotter/agent.py", line 266, in main
[172.31.5.20] out:     args.incremental_backups
[172.31.5.20] out:   File "/usr/local/lib/python2.7/dist-packages/cassandra_snapshotter/agent.py", line 145, in put_from_manifest
[172.31.5.20] out:     for _ in pool.imap(upload_file, ((bucket, f, destination_path(s3_base_path, f), s3_ssenc, buffer_size) for f in files)):
[172.31.5.20] out:   File "/usr/lib/python2.7/multiprocessing/pool.py", line 626, in next
[172.31.5.20] out:     raise value
[172.31.5.20] out: boto.exception.S3ResponseError: S3ResponseError: 400 Bad Request
[172.31.5.20] out: <Error><Code>MalformedXML</Code><Message>The XML you provided was not well-formed or did not validate against our published schema</Message><RequestId>81B0A43D8CBD2F9F</RequestId><HostId>9oFB3k6cP5sntV5RMTQ+8jFFgXdW+kPMWv0C8OEeoWgUszagyijfDz5F/yBjU16tcQMJ5yFpPl4=</HostId></Error>
[172.31.5.20] out:
winks commented 8 years ago

Also, #56 sounds exactly like my original problem

nilstoedtmann commented 8 years ago

We have a "boto.exception.S3ResponseError: S3ResponseError: 200 OK" problem too. I am not sure whether this ticket is about the same issue, hence i reported the details at #56, which matches our problem more.

tbarbugli commented 8 years ago

should be fixed on master now. please re-open if this still happens

winks commented 8 years ago

OK, just managed to test this (1.0.0 release). Successfully snapshotted and uploaded 650 GB over 4 nodes. Thanks a lot!