Closed winks closed 8 years ago
did you retry? if yes, did it work?
On Fri, Sep 25, 2015 at 9:06 AM Florian Anderiasch notifications@github.com wrote:
Hi, I just started using cassandra_snapshotter and I got a similar error twice when running it on my cluster (4 nodes, each seems to have ~18GB of data)
[172.31.10.42] out: boto.exception.S3ResponseError: S3ResponseError: 200 OK
Full log of the run is here https://gist.github.com/winks/989ff567c11879646d3e
Here are outputs of s3cmd du on the path/CLUSTERNAME/IP folders in s3, as you can see, the 1st pair is missing some data while on my first try (2nd pair) it's not so obvious.
17.74581958632916212081 run 1, node 1 12.03844990301877260208 run2
17.30710690561681985855 run 1, node 2 17.32890863530337810516 run 2
18.47752564307302236557 run 1, node 3 18.50679215136915445327 run 2
17.72296549659222364425 run 1, node 4 17.74818217754364013671 run 2
TLDR: I think either there's something wrong with S3 atm, or cassandra_snapshotter should really retry to upload the snapshot if there's a boto error...
— Reply to this email directly or view it on GitHub https://github.com/tbarbugli/cassandra_snapshotter/issues/54.
As I said, I did 2 runs and both failed (but on only one host each, and it were different ones). I wanted to wait until tomorrow to run another one, will come back with results then.
I doubt S3 is having an outage, there is probably something else wrong here. Do you specify the S3 region name? What region are you using?
On Fri, Sep 25, 2015 at 11:35 AM Florian Anderiasch < notifications@github.com> wrote:
As I said, I did 2 runs and both failed (but on only one host each, and it were different ones). I wanted to wait until tomorrow to run another one, will come back with results then.
— Reply to this email directly or view it on GitHub https://github.com/tbarbugli/cassandra_snapshotter/issues/54#issuecomment-143297012 .
My instances are all in eu-west-1a
and I've been using --s3-bucket-region=eu-west-1
Full command used is
cassandra-snapshotter --aws-access-key-id=REDACTED --aws-secret-access-key=SECRET \
--s3-bucket-region=eu-west-1 --s3-bucket-name=COMPANY-cassandra-backup \
--s3-base-path=prod/COMPANY_cluster/$(date +%u) backup \
--hosts=172.31.5.185,172.31.6.149,172.31.5.20,172.31.10.42 --user=cassandra \
--cassandra-conf-path=/etc/cassandra --backup-schema
Make sure your S3 bucket is in the region you specified in the command (you can see that from the web console).
On Fri, Sep 25, 2015 at 1:27 PM Florian Anderiasch notifications@github.com wrote:
My instances are all in eu-west-1a and I've been using --s3-bucket-region=eu-west-1
Full command used is
cassandra-snapshotter --aws-access-key-id=REDACTED --aws-secret-access-key=SECRET \ --s3-bucket-region=eu-west-1 --s3-bucket-name=COMPANY-cassandra-backup \ --s3-base-path=prod/COMPANY_cluster/$(date +%u) backup \ --hosts=172.31.5.185,172.31.6.149,172.31.5.20,172.31.10.42 --user=cassandra \ --cassandra-conf-path=/etc/cassandra --backup-schema
— Reply to this email directly or view it on GitHub https://github.com/tbarbugli/cassandra_snapshotter/issues/54#issuecomment-143331455 .
Yes, it says Ireland, so eu-west-1 afaik.
Hmm, according to these S3 docs
Processing of a Complete Multipart Upload request could take several minutes to complete. After Amazon S3 begins processing the request, it sends an HTTP response header that specifies a 200 OK response. While processing is in progress, Amazon S3 periodically sends whitespace characters to keep the connection from timing out. Because a request could fail after the initial 200 OK response has been sent, it is important that you check the response body to determine whether the request succeeded.
In agent.py#L101 nothing is checked and this seems to be exactly in the loop where I have an error:
for _ in pool.imap(upload_file, ((bucket, f, destination_path(s3_base_path, f), s3_ssenc, buffer_size) for f in files)):
Ran it again now and everything worked fine (4x18GB). I'm gonna keep an eye on this and scan the logfiles for a few days, if it pops up again I'll add a catch/retry in the code above.
OK, so now I've had 3 successful daily backups in a row - I think this can be closed and it looks like a fluke in S3, because I did not change anything. Same bucket, same region, same hosts.
Maybe it's still worth doublechecking that line of code I linked.
Without changing anything, after I said it had run 3 times, it stopped working completely, but with a different error:
[172.31.5.20] out: Traceback (most recent call last):
[172.31.5.20] out: File "/usr/local/bin/cassandra-snapshotter-agent", line 9, in <module>
[172.31.5.20] out: load_entry_point('cassandra-snapshotter==0.4.0', 'console_scripts', 'cassandra-snapshotter-agent')()
[172.31.5.20] out: File "/usr/local/lib/python2.7/dist-packages/cassandra_snapshotter/agent.py", line 266, in main
[172.31.5.20] out: args.incremental_backups
[172.31.5.20] out: File "/usr/local/lib/python2.7/dist-packages/cassandra_snapshotter/agent.py", line 145, in put_from_manifest
[172.31.5.20] out: for _ in pool.imap(upload_file, ((bucket, f, destination_path(s3_base_path, f), s3_ssenc, buffer_size) for f in files)):
[172.31.5.20] out: File "/usr/lib/python2.7/multiprocessing/pool.py", line 626, in next
[172.31.5.20] out: raise value
[172.31.5.20] out: boto.exception.S3ResponseError: S3ResponseError: 400 Bad Request
[172.31.5.20] out: <Error><Code>MalformedXML</Code><Message>The XML you provided was not well-formed or did not validate against our published schema</Message><RequestId>81B0A43D8CBD2F9F</RequestId><HostId>9oFB3k6cP5sntV5RMTQ+8jFFgXdW+kPMWv0C8OEeoWgUszagyijfDz5F/yBjU16tcQMJ5yFpPl4=</HostId></Error>
[172.31.5.20] out:
Also, #56 sounds exactly like my original problem
We have a "boto.exception.S3ResponseError: S3ResponseError: 200 OK" problem too. I am not sure whether this ticket is about the same issue, hence i reported the details at #56, which matches our problem more.
should be fixed on master now. please re-open if this still happens
OK, just managed to test this (1.0.0 release). Successfully snapshotted and uploaded 650 GB over 4 nodes. Thanks a lot!
Hi, I just started using cassandra_snapshotter and I got a similar error twice when running it on my cluster (4 nodes, each seems to have ~18GB of data)
Full log of the run is here
Here are outputs of
s3cmd du
(converted to GB) on thepath/CLUSTERNAME/IP
folders in s3, as you can see, the 1st pair is missing some data while on my first try (2nd pair) it's not so obvious.TLDR: I think either there's something wrong with S3 atm, or cassandra_snapshotter should really retry to upload the snapshot if there's a boto error...
Not sure if this issue is related.