Closed sardonic2 closed 8 years ago
We are suffering from the very same issue. We have 5 nodes, and on many (but not all) of the nightly runs of cassandra_snapshotter, we get fails like this one:
Traceback (most recent call last):
File "/opt/demand_logic/virtual_envs/cassandra_backup/bin/cassandra-snapshotter-agent", line 9, in <module>
load_entry_point('cassandra-snapshotter==0.3.1-dl12', 'console_scripts', 'cassandra-snapshotter-agent')()
File "/opt/demand_logic/virtual_envs/cassandra_backup/local/lib/python2.7/site-packages/cassandra_snapshotter/agent.py", line 221, in main
args.incremental_backups
File "/opt/demand_logic/virtual_envs/cassandra_backup/local/lib/python2.7/site-packages/cassandra_snapshotter/agent.py", line 129, in put_from_manifest
for _ in pool.imap(upload_file, ((bucket, f, destination_path(s3_base_path, f), s3_ssenc) for f in files)):
File "/usr/lib/python2.7/multiprocessing/pool.py", line 626, in next
raise value
boto.exception.S3ResponseError: S3ResponseError: 200 OK
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>InternalError</Code><Message>We encountered an internal error. Please try again.</Message><RequestId>***************</RequestId><HostId>************</HostId></Error>
We get anything from 0 to 9 fails per night, on average 2/night. Here the fails of the past 7 days:
Sep 29 08:11:19 [cassandra01] out: boto.exception.S3ResponseError: S3ResponseError: 200 OK
Sep 30 20:16:02 [cassandra02] out: boto.exception.S3ResponseError: S3ResponseError: 200 OK
Sep 30 20:27:38 [cassandra05] out: boto.exception.S3ResponseError: S3ResponseError: 200 OK
Oct 1 05:02:03 [cassandra04] out: boto.exception.S3ResponseError: S3ResponseError: 200 OK
Oct 1 05:12:05 [cassandra05] out: boto.exception.S3ResponseError: S3ResponseError: 200 OK
Oct 1 05:29:04 [cassandra01] out: boto.exception.S3ResponseError: S3ResponseError: 200 OK
Oct 1 06:02:35 [cassandra02] out: boto.exception.S3ResponseError: S3ResponseError: 200 OK
Oct 1 20:10:43 [cassandra03] out: boto.exception.S3ResponseError: S3ResponseError: 200 OK
Oct 1 20:30:45 [cassandra05] out: boto.exception.S3ResponseError: S3ResponseError: 200 OK
Oct 1 20:31:53 [cassandra02] out: boto.exception.S3ResponseError: S3ResponseError: 200 OK
Oct 1 20:46:31 [cassandra04] out: boto.exception.S3ResponseError: S3ResponseError: 200 OK
Oct 1 22:20:17 [cassandra01] out: boto.exception.S3ResponseError: S3ResponseError: 200 OK
Oct 5 13:21:57 [cassandra04] out: boto.exception.S3ResponseError: S3ResponseError: 200 OK
Oct 5 13:31:12 [cassandra03] out: boto.exception.S3ResponseError: S3ResponseError: 200 OK
Oct 5 13:39:14 [cassandra02] out: boto.exception.S3ResponseError: S3ResponseError: 200 OK
python 2.7.6 boto 2.35.1 cassandra-snapshotter 0.3.1
The command we are using is
cassandra-snapshotter --aws-access-key-id=XXXX --aws-secret-access-XXXX --s3-bucket-region=eu-west-1 --s3-bucket-name=SOME_BUCKET --s3-base-path=DLCluster backup --agent-virtualenv=/opt/demand_logic/virtual_envs/cassandra_backup --backup-schema --hosts=cassandra05,cassandra04,cassandra03,cassandra02,cassandra01 --user=cassandra --sshkey=/home/admin/.ssh/cassandra_backup.rsa --no-sudo --keyspaces=SOME_KEYSPACE
( We added the options "--agent-virtualenv", "--sshkey" and "--no-sudo" in our fork )
I am trying to reproduce this issue. Could you share some information about the size of the files uploaded to S3?
Same issue
boto-2.38.0 python-2.7.9 cassandra_snapshotter 0.4.0
running on a node outside of cassandra cluster with command
sudo /usr/local/bin/cassandra-snapshotter --aws-access-key-id=xxxxxxxx --aws-secret-access-key=yyyyyyyy --s3-bucket-name=zzzzzzz --s3-bucket-region=eu-west-1 --s3-base-path=backups/cassandra backup --hosts=cassandra01,cassandra02,cassandra03 --user=user --use-sudo=yes --new-snapshot
Size of backup is near 80GB per cassandra host in 2 keyspaces and 1 table in each. The exception is as mentioned above.
Additional details - no ring and manifest files ever written. Most often data files in hosts folders on S3 are incomplete after finishing cassandra_snapshotter. Sometimes some keyspaces are missing for some hosts (each next run host lacking keyspace folder is different). Sometimes keyspace and column family folder is there but it is smaller than expected so aparently not all data files were copied into.
When failure happens for some of cassandra hosts the Exception is the following
user: [cassandra02] out: lzop 1.03
user: [cassandra02] out: LZO library 2.08
user: [cassandra02] out: Copyright (C) 1996-2010 Markus Franz Xaver Johannes Oberhumer
user: [cassandra02] out: Traceback (most recent call last):
user: [cassandra02] out: File "/usr/bin/cassandra-snapshotter-agent", line 9, in InternalError
We are also having this exact same issue.
45 Node Cluster (backing up 15 nodes) about 1TB per node SSTables size 160MB.
[cassa-e1d-07] out: Traceback (most recent call last):
[cassa-e1d-07] out: File "/usr/local/bin/cassandra-snapshotter-agent", line 9, in InternalError
On my end the file sizes were around 160MB. Density wise it was 1.3TB per node over 6 nodes.
can anyone try to downgrade to 0.3.1 and share results? I am trying to reproduce this on 0.4.0 without luck so far
@sardonic2 @zachaller @igchub can any of you add this at the top of agent.py to enable boto debug logging on stdout? Also can you add the S3 region for you backup bucket here?
import boto
boto.set_stream_logger('boto')
I am using 0.3.1 and get this error (see https://github.com/tbarbugli/cassandra_snapshotter/issues/56#issuecomment-145831983 ).
We have 650 GB data total over 5 nodes, the SST files are 120-130GB per node. Largest individual SST file per node varies from 25 to 62GB
We take a new snapshot once a week, and differential backups all other nights.
Region = eu-west-1
if any of you guys can add the boto logging snippet, it would be great. I am having issues reproducing this in an isolated environment.
Have been trying to get the debug logging to run, in doing so I am not sure if it is not happening or if by enable it its just a lot harder to see any errors due to the amount of output. I also am having issue with RAM when debugging is turned and I run out of ram up to a r3.xl currently but i do not have high hopes for this to work either.
Some other info
S3: us-east-1 using VPC with using S3 endpoint Cassandra 2.1.9 cassandra_snapshotter 0.4.0 also tried with 0.3.1 (used to run great but had to turn backups off for a while recently tried to turn them on again and started getting this issue)
on what end do you get high memory usage? perhaps you can try setting the log level bit higher:
import boto
boto.set_stream_logger('boto', level=logging.WARNING)
On Wed, Oct 14, 2015 at 10:21 AM Zach Aller notifications@github.com wrote:
Have been trying to get the debug logging to run, in doing so I am not sure if it is not happening or if by enable it its just a lot harder to see any errors due to the amount of output. I also am having issue with RAM when debugging is turned and I run out of ram up to a r3.xl currently but i do not have high hopes for this to work either.
— Reply to this email directly or view it on GitHub https://github.com/tbarbugli/cassandra_snapshotter/issues/56#issuecomment-148105152 .
Ok will try that and the ram issue was on the box that started the backup, and i edited the code on all the cassandra nodes.
alright, my guess is that default boto log level (DEBUG) outputs too much and the main backup process gets overwhelmed. perhaps it makes sense to tee the output to file for later inspection
On Wed, Oct 14, 2015 at 10:35 AM Zach Aller notifications@github.com wrote:
Ok will try that and the ram issue was on the box that started the backup, and i edited the code on all the cassandra nodes.
— Reply to this email directly or view it on GitHub https://github.com/tbarbugli/cassandra_snapshotter/issues/56#issuecomment-148109098 .
It was running out of memory even when doing a redirection of output at the end of the command, the setting logging to WARNING is running ok however
cassabackup.log 2>&1
With logging set to warning but did not seem to get any more info unless i set it wrong
/usr/local/bin/cassandra-snapshotter-agent
import boto
import logging
#boto.set_stream_logger('boto')
boto.set_stream_logger('boto', level=logging.WARNING)
Fatal error: sudo() received nonzero return code 1 while executing!
Requested: cassandra-snapshotter-agent put --aws-access-key-id=************ --aws-secret-access-key=************** --s3-bucket-name=gf-backup-cassandra --s3-bucket-region=us-east-1 --s3-base-path=ec2_east_daily/20151014163112/cassa-e1d-04 --manifest=/tmp/backupmanifest --bufsize=64 --concurrency=4
Executed: sudo -S -p 'sudo password:' /bin/bash -l -c "cassandra-snapshotter-agent put --aws-access-key-id=**************** --aws-secret-access-key=****************** --s3-bucket-name=gf-backup-cassandra --s3-bucket-region=us-east-1 --s3-base-path=ec2_east_daily/20151014163112/cassa-e1d-04 --manifest=/tmp/backupmanifest --bufsize=64 --concurrency=4"
Aborting.
[cassa-e1d-04] sudo: cassandra-snapshotter-agent create-upload-manifest --manifest_path=/tmp/backupmanifest --snapshot_name=20151014163112 --snapshot_keyspaces= --snapshot_table= --conf_path=/etc/cassandra/
[cassa-e1d-04] sudo: cassandra-snapshotter-agent put --aws-access-key-id=**************** --aws-secret-access-key=********************* --s3-bucket-name=gf-backup-cassandra --s3-bucket-region=us-east-1 --s3-base-path=ec2_east_daily/20151014163112/cassa-e1d-04 --manifest=/tmp/backupmanifest --bufsize=64 --concurrency=4
[cassa-e1d-04] out: lzop 1.03
[cassa-e1d-04] out: LZO library 2.06
[cassa-e1d-04] out: Copyright (C) 1996-2010 Markus Franz Xaver Johannes Oberhumer
[cassa-e1d-04] out: Traceback (most recent call last):
[cassa-e1d-04] out: File "/usr/local/bin/cassandra-snapshotter-agent", line 13, in <module>
[cassa-e1d-04] out: load_entry_point('cassandra-snapshotter==0.4.0', 'console_scripts', 'cassandra-snapshotter-agent')()
[cassa-e1d-04] out: File "/usr/local/lib/python2.7/dist-packages/cassandra_snapshotter/agent.py", line 266, in main
[cassa-e1d-04] out: args.incremental_backups
[cassa-e1d-04] out: File "/usr/local/lib/python2.7/dist-packages/cassandra_snapshotter/agent.py", line 145, in put_from_manifest
[cassa-e1d-04] out: for _ in pool.imap(upload_file, ((bucket, f, destination_path(s3_base_path, f), s3_ssenc, buffer_size) for f in files)):
[cassa-e1d-04] out: File "/usr/lib/python2.7/multiprocessing/pool.py", line 659, in next
[cassa-e1d-04] out: raise value
[cassa-e1d-04] out: boto.exception.S3ResponseError: S3ResponseError: 200 OK
[cassa-e1d-04] out: <?xml version="1.0" encoding="UTF-8"?>
[cassa-e1d-04] out: <Error><Code>InternalError</Code><Message>We encountered an internal error. Please try again.</Message><RequestId>363F4C7F3159</RequestId><HostId>xEz/zOVxDP9Qunc/IpFU/vqiFkBrUn0h8AhmY4awt2tjbkBIBRQY</HostId></Error>
[cassa-e1d-04] out: lzop: Broken pipe: <stdout>
[cassa-e1d-04] out:
have you tried changing the concurrency settings at agent level? can you try to set the size of the pool in agent.py to 1 and see if this still happens?
On Wed, Oct 14, 2015 at 11:00 AM Zach Aller notifications@github.com wrote:
With logging set to warning but did not seem to get any more info unless i set it wrong
/usr/local/bin/cassandra-snapshotter-agent
import boto import logging
boto.set_stream_logger('boto')
boto.set_stream_logger('boto', level=logging.WARNING)
Fatal error: sudo() received nonzero return code 1 while executing!
Requested: cassandra-snapshotter-agent put --aws-access-key-id=**** --aws-secret-access-key=****** --s3-bucket-name=gf-backup-cassandra --s3-bucket-region=us-east-1 --s3-base-path=ec2_east_daily/20151014163112/cassa-e1d-04 --manifest=/tmp/backupmanifest --bufsize=64 --concurrency=4 Executed: sudo -S -p 'sudo password:' /bin/bash -l -c "cassandra-snapshotter-agent put --aws-access-key-id=**** --aws-secret-access-key=****** --s3-bucket-name=gf-backup-cassandra --s3-bucket-region=us-east-1 --s3-base-path=ec2_east_daily/20151014163112/cassa-e1d-04 --manifest=/tmp/backupmanifest --bufsize=64 --concurrency=4"
Aborting. [cassa-e1d-04] sudo: cassandra-snapshotter-agent create-upload-manifest --manifest_path=/tmp/backupmanifest --snapshot_name=20151014163112 --snapshot_keyspaces= --snapshot_table= --conf_path=/etc/cassandra/ [cassa-e1d-04] sudo: cassandra-snapshotter-agent put --aws-access-key-id=AKIAIW3MLRNMAG6RPE4Q --aws-secret-access-key=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 --s3-bucket-name=gf-backup-cassandra --s3-bucket-region=us-east-1 --s3-base-path=ec2_east_daily/20151014163112/cassa-e1d-04 --manifest=/tmp/backupmanifest --bufsize=64 --concurrency=4 [cassa-e1d-04] out: lzop 1.03 [cassa-e1d-04] out: LZO library 2.06 [cassa-e1d-04] out: Copyright (C) 1996-2010 Markus Franz Xaver Johannes Oberhumer [cassa-e1d-04] out: Traceback (most recent call last): [cassa-e1d-04] out: File "/usr/local/bin/cassandra-snapshotter-agent", line 13, in
[cassa-e1d-04] out: load_entry_point('cassandra-snapshotter==0.4.0', 'console_scripts', 'cassandra-snapshotter-agent')() [cassa-e1d-04] out: File "/usr/local/lib/python2.7/dist-packages/cassandra_snapshotter/agent.py", line 266, in main [cassa-e1d-04] out: args.incremental_backups [cassa-e1d-04] out: File "/usr/local/lib/python2.7/dist-packages/cassandra_snapshotter/agent.py", line 145, in put_frommanifest [cassa-e1d-04] out: for in pool.imap(upload_file, ((bucket, f, destination_path(s3_base_path, f), s3_ssenc, buffer_size) for f in files)): [cassa-e1d-04] out: File "/usr/lib/python2.7/multiprocessing/pool.py", line 659, in next [cassa-e1d-04] out: raise value [cassa-e1d-04] out: boto.exception.S3ResponseError: S3ResponseError: 200 OK [cassa-e1d-04] out: <?xml version="1.0" encoding="UTF-8"?> [cassa-e1d-04] out: [cassa-e1d-04] out: lzop: Broken pipe: InternalError
We encountered an internal error. Please try again. 363F4C7F3159 xEz/zOVxDP9Qunc/IpFU/vqiFkBrUn0h8AhmY4awt2tjbkBIBRQY [cassa-e1d-04] out: — Reply to this email directly or view it on GitHub https://github.com/tbarbugli/cassandra_snapshotter/issues/56#issuecomment-148117176 .
I am trying with a concurrency of 1 right now but if I drop the pool size i run into the other issues i had were only one node gets backed up and its the last node on the list
sorry, I was not really specific. The concurrency setting I refer to is related to S3 file upload: https://github.com/tbarbugli/cassandra_snapshotter/blob/master/cassandra_snapshotter/agent.py#L144
On Wed, Oct 14, 2015 at 11:35 AM Zach Aller notifications@github.com wrote:
I am trying with a concurrency of 1 right now but if I drop the pool size i run into the other issues i had were only one node gets backed up and its the last node on the list
50 https://github.com/tbarbugli/cassandra_snapshotter/pull/50
— Reply to this email directly or view it on GitHub https://github.com/tbarbugli/cassandra_snapshotter/issues/56#issuecomment-148128335 .
after following the code a bit i set this line to 1
https://github.com/tbarbugli/cassandra_snapshotter/blob/master/cassandra_snapshotter/agent.py#L29
It seems to be the same thing since i am using the default, also it is still running and will be for a while but it has run the longest so far but its probably too early to tell yet. The upload rate of the cassandra boxes are obviously quite a bit less also do to the one thread. Will report back after it completes or if it fails.
ahhh spoke too soon
[cassa-e1d-01] out: Copyright (C) 1996-2010 Markus Franz Xaver Johannes Oberhumer
[cassa-e1d-07] out: Traceback (most recent call last):
[cassa-e1d-07] out: File "/usr/local/bin/cassandra-snapshotter-agent", line 13, in <module>
[cassa-e1d-07] out: load_entry_point('cassandra-snapshotter==0.4.0', 'console_scripts', 'cassandra-snapshotter-agent')()
[cassa-e1d-07] out: File "/usr/local/lib/python2.7/dist-packages/cassandra_snapshotter/agent.py", line 266, in main
[cassa-e1d-07] out: args.incremental_backups
[cassa-e1d-07] out: File "/usr/local/lib/python2.7/dist-packages/cassandra_snapshotter/agent.py", line 145, in put_from_manifest
[cassa-e1d-07] out: for _ in pool.imap(upload_file, ((bucket, f, destination_path(s3_base_path, f), s3_ssenc, buffer_size) for f in files)):
[cassa-e1d-07] out: File "/usr/lib/python2.7/multiprocessing/pool.py", line 659, in next
[cassa-e1d-07] out: raise value
[cassa-e1d-07] out: boto.exception.S3ResponseError: S3ResponseError: 200 OK
[cassa-e1d-07] out: <?xml version="1.0" encoding="UTF-8"?>
[cassa-e1d-07] out: <Error><Code>InternalError</Code><Message>We encountered an internal error. Please try again.</Message><RequestId>E7A0A31A2290</RequestId><HostId>tSz5aWJpIOWcc8KzlQFBL9KEfWowVbC9e00c6iIPOe0A+Jroy2h4jSIK4</HostId></Error>
[cassa-e1d-07] out: lzop: Broken pipe: <stdout>
[cassa-e1d-07] out: lzop: Broken pipe: <stdout>
TBH I am glad this is not a concurrency problem :) I am going to push some changes to master for agent.py in half hour, that might either fix the issue or give us a better idea about what's going on with S3
On Wed, Oct 14, 2015 at 12:25 PM Zach Aller notifications@github.com wrote:
ahhh spoke too soon
[cassa-e1d-01] out: Copyright (C) 1996-2010 Markus Franz Xaver Johannes Oberhumer [cassa-e1d-07] out: Traceback (most recent call last): [cassa-e1d-07] out: File "/usr/local/bin/cassandra-snapshotter-agent", line 13, in
[cassa-e1d-07] out: load_entry_point('cassandra-snapshotter==0.4.0', 'console_scripts', 'cassandra-snapshotter-agent')() [cassa-e1d-07] out: File "/usr/local/lib/python2.7/dist-packages/cassandra_snapshotter/agent.py", line 266, in main [cassa-e1d-07] out: args.incremental_backups [cassa-e1d-07] out: File "/usr/local/lib/python2.7/dist-packages/cassandra_snapshotter/agent.py", line 145, in put_frommanifest [cassa-e1d-07] out: for in pool.imap(upload_file, ((bucket, f, destination_path(s3_base_path, f), s3_ssenc, buffer_size) for f in files)): [cassa-e1d-07] out: File "/usr/lib/python2.7/multiprocessing/pool.py", line 659, in next [cassa-e1d-07] out: raise value [cassa-e1d-07] out: boto.exception.S3ResponseError: S3ResponseError: 200 OK [cassa-e1d-07] out: <?xml version="1.0" encoding="UTF-8"?> [cassa-e1d-07] out: [cassa-e1d-07] out: lzop: Broken pipe: InternalError
We encountered an internal error. Please try again. E7A0A31A2290 tSz5aWJpIOWcc8KzlQFBL9KEfWowVbC9e00c6iIPOe0A+Jroy2h4jSIK4 [cassa-e1d-07] out: lzop: Broken pipe: — Reply to this email directly or view it on GitHub https://github.com/tbarbugli/cassandra_snapshotter/issues/56#issuecomment-148145053 .
agent.py on master has now better logging capabilities, also I should have fixed a bunch of corner cases with retries and multipart upload error handling
bad news same issue
[cassa-e1d-12] out: lzop 1.03
[cassa-e1d-12] out: LZO library 2.06
[cassa-e1d-12] out: Copyright (C) 1996-2010 Markus Franz Xaver Johannes Oberhumer
[cassa-e1d-08] out: Traceback (most recent call last):
[cassa-e1d-08] out: File "/usr/local/bin/cassandra-snapshotter-agent", line 9, in <module>
[cassa-e1d-08] out: load_entry_point('cassandra-snapshotter==0.4.0', 'console_scripts', 'cassandra-snapshotter-agent')()
[cassa-e1d-08] out: File "/usr/local/lib/python2.7/dist-packages/cassandra_snapshotter/agent.py", line 281, in main
[cassa-e1d-08] out: args.incremental_backups
[cassa-e1d-08] out: File "/usr/local/lib/python2.7/dist-packages/cassandra_snapshotter/agent.py", line 158, in put_from_manifest
[cassa-e1d-08] out: for ret in pool.imap(upload_file, ((bucket, f, destination_path(s3_base_path, f), s3_ssenc, buffer_size) for f in files)):
[cassa-e1d-08] out: File "/usr/lib/python2.7/multiprocessing/pool.py", line 659, in next
[cassa-e1d-08] out: raise value
[cassa-e1d-08] out: boto.exception.S3ResponseError: S3ResponseError: 200 OK
[cassa-e1d-08] out: <?xml version="1.0" encoding="UTF-8"?>
[cassa-e1d-08] out: <Error><Code>InternalError</Code><Message>We encountered an internal error. Please try again.</Message><RequestId>7B8614903A92886C</RequestId><HostId>gTrs3Q40C6Ia4SJkT+mBHRB+FemhVumY9pQfDy6kzlXWpUZGFeqZsK5E80ljw+q66n/nh3Ncwho=</HostId></Error>
[cassa-e1d-08] out: lzop: Broken pipe: <stdout>
[cassa-e1d-08] out: lzop: Broken pipe: <stdout>
[cassa-e1d-08] out: lzop: Broken pipe: <stdout>
[cassa-e1d-08] out:
Fatal error: sudo() received nonzero return code 1 while executing!
alright, added even more logging to some parts of code that might be causing this. thank you for your patience and help :)
On Wed, Oct 14, 2015 at 3:09 PM Zach Aller notifications@github.com wrote:
bad news same issue
[cassa-e1d-12] out: lzop 1.03 [cassa-e1d-12] out: LZO library 2.06 [cassa-e1d-12] out: Copyright (C) 1996-2010 Markus Franz Xaver Johannes Oberhumer [cassa-e1d-08] out: Traceback (most recent call last): [cassa-e1d-08] out: File "/usr/local/bin/cassandra-snapshotter-agent", line 9, in
[cassa-e1d-08] out: load_entry_point('cassandra-snapshotter==0.4.0', 'console_scripts', 'cassandra-snapshotter-agent')() [cassa-e1d-08] out: File "/usr/local/lib/python2.7/dist-packages/cassandra_snapshotter/agent.py", line 281, in main [cassa-e1d-08] out: args.incremental_backups [cassa-e1d-08] out: File "/usr/local/lib/python2.7/dist-packages/cassandra_snapshotter/agent.py", line 158, in put_from_manifest [cassa-e1d-08] out: for ret in pool.imap(upload_file, ((bucket, f, destination_path(s3_base_path, f), s3_ssenc, buffer_size) for f in files)): [cassa-e1d-08] out: File "/usr/lib/python2.7/multiprocessing/pool.py", line 659, in next [cassa-e1d-08] out: raise value [cassa-e1d-08] out: boto.exception.S3ResponseError: S3ResponseError: 200 OK [cassa-e1d-08] out: <?xml version="1.0" encoding="UTF-8"?> [cassa-e1d-08] out: [cassa-e1d-08] out: lzop: Broken pipe: InternalError
We encountered an internal error. Please try again. 7B8614903A92886C gTrs3Q40C6Ia4SJkT+mBHRB+FemhVumY9pQfDy6kzlXWpUZGFeqZsK5E80ljw+q66n/nh3Ncwho= [cassa-e1d-08] out: lzop: Broken pipe: [cassa-e1d-08] out: lzop: Broken pipe: [cassa-e1d-08] out: Fatal error: sudo() received nonzero return code 1 while executing!
— Reply to this email directly or view it on GitHub https://github.com/tbarbugli/cassandra_snapshotter/issues/56#issuecomment-148201375 .
[cassa-e1d-05] out: Copyright (C) 1996-2010 Markus Franz Xaver Johannes Oberhumer
[cassa-e1d-05] out: cassandra_snapshotter.agent ERROR MSG: Error completing multipart upload for file /ephemeral/cassandra/data/wxgrid_long_lived_v2_data/grid/snapshots/20151015020934/wxgrid_long_lived_v2_data-grid-ka-1248573-Index.db to ec2_east_daily/20151015020934/cassa-e1d-05//ephemeral/cassandra/data/wxgrid_long_lived_v2_data/grid/snapshots/20151015020934/wxgrid_long_lived_v2_data-grid-ka-1248573-Index.db.lzo
[cassa-e1d-05] out: STRUCTURED: time=2015-10-15T02:47:01.910852-00 pid=25043
[cassa-e1d-05] out: cassandra_snapshotter.agent ERROR MSG: We encountered an internal error. Please try again.
[cassa-e1d-05] out: STRUCTURED: time=2015-10-15T02:47:01.911333-00 pid=25043
[cassa-e1d-05] out: cassandra_snapshotter.agent ERROR MSG: Error while cancelling multipart upload
[cassa-e1d-05] out: STRUCTURED: time=2015-10-15T02:47:05.951088-00 pid=25043
[cassa-e1d-05] out: cassandra_snapshotter.agent ERROR MSG: Error while cancelling multipart upload
[cassa-e1d-05] out: STRUCTURED: time=2015-10-15T02:47:09.987280-00 pid=25043
With this portion repeating untill i killed it
[cassa-e1d-05] out: STRUCTURED: time=2015-10-15T02:47:05.951088-00 pid=25043
[cassa-e1d-05] out: cassandra_snapshotter.agent ERROR MSG: Error while cancelling multipart upload
My quick look at your code and the logs it would seem i should see some retry logging output but i do not see any its late now so will look more at it in the morning but my first thought is it dose not seem to be retrying
what's the size of the file that fails?
I added more logging including the dump of the XML for the complete upload request on master
Here is the size I will try update and start another run right now
cassa-e1d-05 /e/cassandra# ls -lah /ephemeral/cassandra/data/wxgrid_long_lived_v2_data/grid/wxgrid_long_lived_v2_data-grid-ka-1248573-Index.db
-rw-r--r-- 2 cassandra cassandra 15M Sep 30 07:15 /ephemeral/cassandra/data/wxgrid_long_lived_v2_data/grid/wxgrid_long_lived_v2_data-grid-ka-1248573-Index.db
file size does not look really interesting, I can 't find anything clearly wrong with it when I look at S3 specs
It looks like it continues now i thought it was runing longer than before did some seaching through the log file found this
[cassa-e1d-03] out: cassandra_snapshotter.agent INFO MSG: Initialized multipart upload for file /ephemeral/cassandra/data/wxgrid_long_lived_v2_data/grid/snapshots/20151015160953/wxgrid_long_lived_v2_data-grid-ka-2054131-Summary.db to ec2_east_daily/20151015160953/
cassa-e1d-03//ephemeral/cassandra/data/wxgrid_long_lived_v2_data/grid/snapshots/20151015160953/wxgrid_long_lived_v2_data-grid-ka-2054131-Summary.db.lzo
[cassa-e1d-03] out: STRUCTURED: time=2015-10-15T16:17:57.302001-00 pid=16713
[cassa-e1d-03] out: cassandra_snapshotter.agent INFO MSG: Initialized multipart upload for file /ephemeral/cassandra/data/wxgrid_long_lived_v2_data/grid/snapshots/20151015160953/wxgrid_long_lived_v2_data-grid-ka-1878941-Digest.sha1 to ec2_east_daily/20151015160953
/cassa-e1d-03//ephemeral/cassandra/data/wxgrid_long_lived_v2_data/grid/snapshots/20151015160953/wxgrid_long_lived_v2_data-grid-ka-1878941-Digest.sha1.lzo
[cassa-e1d-03] out: STRUCTURED: time=2015-10-15T16:17:57.460207-00 pid=16713
[cassa-e1d-03] out: cassandra_snapshotter.agent ERROR MSG: Error completing multipart upload for file /ephemeral/cassandra/data/wxgrid_long_lived_v2_data/grid/snapshots/20151015160953/wxgrid_long_lived_v2_data-grid-ka-1892112-TOC.txt to ec2_east_daily/2015101516095
3/cassa-e1d-03//ephemeral/cassandra/data/wxgrid_long_lived_v2_data/grid/snapshots/20151015160953/wxgrid_long_lived_v2_data-grid-ka-1892112-TOC.txt.lzo
[cassa-e1d-03] out: STRUCTURED: time=2015-10-15T16:17:57.468279-00 pid=16713
[cassa-e1d-03] out: cassandra_snapshotter.agent ERROR MSG: We encountered an internal error. Please try again.
[cassa-e1d-03] out: STRUCTURED: time=2015-10-15T16:17:57.469028-00 pid=16713
[cassa-e1d-03] out: cassandra_snapshotter.agent ERROR MSG: <CompleteMultipartUpload>
[cassa-e1d-03] out: <Part>
[cassa-e1d-03] out: <PartNumber>1</PartNumber>
[cassa-e1d-03] out: <ETag>"baf143bd0580b4ee95ee4dab647ccf95"</ETag>
[cassa-e1d-03] out: </Part>
[cassa-e1d-03] out: </CompleteMultipartUpload>
[cassa-e1d-03] out: STRUCTURED: time=2015-10-15T16:17:57.490009-00 pid=16713
[cassa-e1d-03] out: cassandra_snapshotter.agent INF_daily/20151015160953/cassa-e1d-10//ephemeral/cassandra/data/soil_model_v2_data/grid/snapshots/20151015160953/soil_model_v2_data-grid-ka-374913-Statistics.db.lzo
[cassa-e1d-10] out: STRUCTURED: time=2015-10-15T16:17:57.501288-00 pid=1278
[cassa-e1d-10] out: cassandra_snapshotter.agent INFO MSG: Initialized multipart upload for file /ephemeral/cassandra/data/soil_model_v2_data/grid/snapshots/20151015160953/soil_model_v2_data-grid-jb-241027-Index.db to ec2_east_daily/20151015160953/cassa-e1d-10//ephemeral/cassandra/data/soil_model_v2_data/grid/snapshots/20151015160953/soil_model_v2_data-grid-jb-241027-Index.db.lzo
[cassa-e1d-10] out: STRUCTURED: time=2015-10-15T16:17:57.620265-00 pid=1278
[cassa-e1d-10] out: cassandra_snapshotter.agent INFO MSG: Initialized multipart upload for file /ephemeral/cassandra/data/soil_model_v2_data/grid/snapshots/20151015160953/soil_model_v2_data-grid-ka-434330-Digest.sha1 to ec2_east_daily/20151015160953/cassa-e1d-10//ephemeral/cassandra/data/soil_model_v2_data/grid/snapshots/20151015160953/soil_model_v2_data-grid-ka-434330-Digest.sha1.lzo
[cassa-e1d-10] out: STRUCTURED: time=2015-10-15T16:17:57.723364-00 pid=1278
Would you expect this line to be called to attempt a retry or not
here is one more snippit of log two with multiple parts
[cassa-e1d-12] out: cassandra_snapshotter.agent INFO MSG: Initialized multipart upload for file /ephemeral/cassandra/data/soil_model_v2_data/grid/snapshots/20151015160953/soil_model_v2_data-grid-ka-417837-Index.db to ec2_east_daily/20151015160953/cassa-e1d-12//eph
emeral/cassandra/data/soil_model_v2_data/grid/snapshots/20151015160953/soil_model_v2_data-grid-ka-417837-Index.db.lzo
[cassa-e1d-12] out: STRUCTURED: time=2015-10-15T16:27:20.474632-00 pid=19955
[cassa-e1d-12] out: cassandra_snapshotter.agent ERROR MSG: Error completing multipart upload for file /ephemeral/cassandra/data/soil_model_v2_data/grid/snapshots/20151015160953/soil_model_v2_data-grid-ka-451954-Data.db to ec2_east_daily/20151015160953/cassa-e1d-12/
/ephemeral/cassandra/data/soil_model_v2_data/grid/snapshots/20151015160953/soil_model_v2_data-grid-ka-451954-Data.db.lzo
[cassa-e1d-12] out: STRUCTURED: time=2015-10-15T16:27:20.485370-00 pid=19955
[cassa-e1d-12] out: cassandra_snapshotter.agent ERROR MSG: We encountered an internal error. Please try again.
[cassa-e1d-12] out: STRUCTURED: time=2015-10-15T16:27:20.485718-00 pid=19955
[cassa-e1d-12] out: cassandra_snapshotter.agent ERROR MSG: <CompleteMultipartUpload>
[cassa-e1d-12] out: <Part>
[cassa-e1d-12] out: <PartNumber>1</PartNumber>
[cassa-e1d-12] out: <ETag>"31569bf77c40d1e7b629012556573b2e"</ETag>
[cassa-e1d-12] out: </Part>
[cassa-e1d-12] out: <Part>
[cassa-e1d-12] out: <PartNumber>2</PartNumber>
[cassa-e1d-12] out: <ETag>"a087ab7309593942c8515d701e2442c5"</ETag>
[cassa-e1d-12] out: </Part>
[cassa-e1d-12] out: <Part>
[cassa-e1d-12] out: <PartNumber>3</PartNumber>
[cassa-e1d-12] out: <ETag>"971a54b40fd7d11e8fd171d742e4a4ec"</ETag>
[cassa-e1d-12] out: </Part>
[cassa-e1d-12] out: </CompleteMultipartUpload>
[cassa-e1d-12] out: STRUCTURED: time=2015-10-15T16:27:20.504481-00 pid=19955
[cassa-e1d-12] out: cassandra_snapshotter.agent INFO MSG: Initialized multipart upload for file /ephemeral/cassandra/data/soil_model_v2_data/grid/snapshots/20151015160953/soil_model_v2_data-grid-ka-381739-Statistics.db to ec2_east_daily/20151015160953/cassa-e1d-12//ephemeral/cassandra/data/soil_model_v2_data/grid/snapshots/20151015160953/soil_model_v2_data-grid-ka-381739-Statistics.db.lzo
[cassa-e1d-12] out: STRUCTURED: time=2015-10-15T16:27:20.686339-00 pid=19955
[cassa-e1d-12] out: cassandra_snapshotter.agent INFO MSG: Initialized multipart upload for file /ephemeral/cassandra/data/soil_model_v2_data/grid/snapshots/20151015160953/soil_model_v2_data-grid-ka-354924-Digest.sha1 to ec2_east_daily/20151015160953/cassa-e1d-12//ephemeral/cassandra/data/soil_model_v2_data/grid/snapshots/20151015160953/soil_model_v2_data-grid-ka-354924-Digest.sha1.lzo
[cassa-e1d-12] out:-Data.db to ec2_east_daily/20151015160953/cassa-e1d-10//ephemeral/cassandra/data/soil_model_v2_data/grid/snapshots/20151015160953/soil_model_v2_data-grid-ka-402207-Data.db.lzo
[cassa-e1d-10] out: STRUCTURED: time=2015-10-15T16:27:04.102644-00 pid=1278
it should only retry to upload parts or to cancel the upload. can you still say what is the file size for the file that failed to upload?
Looks like the file in the log above was
-rw-r--r-- 2 cassandra cassandra 161M Oct 14 07:45 /ephemeral/cassandra/data/soil_model_v2_data/grid/soil_model_v2_data-grid-ka-451954-Data.db
I also had some other interesting error in the file but not sure if i cleaned the file out or not so re running again and will report back once i confirm or see the errors again.
alright, let me know when you have some more information about this. BTW did you try using a version earlier than 0.4? I think this is something we introduced on the last release, it would be great to have a confirmation about that.
On Fri, Oct 16, 2015 at 10:45 AM Zach Aller notifications@github.com wrote:
Looks like the file in the log above was
-rw-r--r-- 2 cassandra cassandra 161M Oct 14 07:45 /ephemeral/cassandra/data/soil_model_v2_data/grid/soil_model_v2_data-grid-ka-451954-Data.db
I also had some other interesting error in the file but not sure if i cleaned the file out or not so re running again and will report back once i confirm or see the errors again.
— Reply to this email directly or view it on GitHub https://github.com/tbarbugli/cassandra_snapshotter/issues/56#issuecomment-148765133 .
Version 0.3.1 had the same issue
out: lzop 1.03
out: LZO library 2.08
out: Copyright (C) 1996-2010 Markus Franz Xaver Johannes Oberhumer
out: Traceback (most recent call last):
out: File "/usr/local/bin/cassandra-snapshotter-agent", line 9, in InternalError
We have the same problem happening, but only with incremental backups. I noticed that sometimes the local backupmanifest files have a lot of compaction related sstables. We have this error when the backupmanifest file looks wrong. I rerun it and now the manifest looks ok. It seems very random. Please let me know if you need more info.
This is happening with 0.3.1 and also 0.40
The same issue: 4 nodes, 100Gb, eu-east-1 region. According to the AWS forum - this is S3 side problem. The best way to handle it is to retry operation again after some timeout. Related discussion with the links to AWS forum topics: https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1590
going to add retries for every step of the multipart upload in the next release
@tbarbugli I'm running 3.1, and am experiencing this. We have seen this in S3 before when uploading vast amounts of files, and retries immediately often don't work. We found retrying the failing file later fixes this, so maybe pop that file off the list and then put it back onto the end of the list and retry later, and retry up to 15 times backing off perhaps.
Experiencing this intermittently with 0.4.0. 6 nodes, over 100G on each. Tends to only occur on one node and the rest work fine.
We're having this issue as well, 4 nodes with ~200GB each. As others have said, tends to only occur on one node. Has there been any progress resolving it?
On 02/02/16 10:48, James Griffin wrote:
We're having this issue as well, 4 nodes with ~200GB each. As others have said, tends to only occur on one node. Has there been any progress resolving it?
I have a patch that fixes this for me (attached, patch.diff). I didn't push it because it goes on top of a previous patch I pushed which hasn't been pulled into the master yet. The attached patch all.diff combines the two and should apply correctly to the master.
Ciao, Duncan.
@CunningBaldrick Might be missing something but I can't see the mentioned files - could you re-send?
the problem seems to be related to not retrying every step of the multi-part upload. some work is on master to mitigate the problem but I did not have the time to properly write and test a patch for this. @DuncanSands can you create a PR from your patch files?
Hi Tommaso, the patch touches some of the same places as this pull request: https://github.com/tbarbugli/cassandra_snapshotter/pull/69 So can you please pull that one, and I will then send a pull request that goes on top of it. I could also do things the other way round: tweak my patch for this issue so it applies to master, and fix up the above pull request (because otherwise it won't apply any more).
@DuncanSands 69 is merged, can you try to create the PR again?
On 17/02/16 10:30, Tommaso Barbugli wrote:
@DuncanSands https://github.com/DuncanSands 69 is merged, can you try to create the PR again?
I made a pull request, https://github.com/tbarbugli/cassandra_snapshotter/pull/75
The continuous integration build failed, but it doesn't seem to have anything to do with this patch. The failure is:
File "build/bdist.linux-x86_64/egg/cassandra_snapshotter/logging_helper.py", line 80 except EnvironmentError, e: ^ SyntaxError: invalid syntax
Ciao, Duncan.
@DuncanSands that's a python 3 syntax issue coming from a different base. I will fix that after your PR is merged. Did you see my comments inline to your PR's code?
Hosts periodically throw attached error, causing backups to fail
boto-2.38 python-2.7.6 cassandra_snapshotter 0.4.0