tbarbugli / cassandra_snapshotter

A tool to backup cassandra nodes using snapshots and incremental backups on S3
Other
222 stars 122 forks source link

Error: backups fail on exception #56

Closed sardonic2 closed 8 years ago

sardonic2 commented 9 years ago

Hosts periodically throw attached error, causing backups to fail

boto-2.38 python-2.7.6 cassandra_snapshotter 0.4.0

[db4a.] out:   File "/usr/local/bin/cassandra-snapshotter-agent", line 9, in <module>
[db4a.] out:     load_entry_point('cassandra-snapshotter==0.4.0', 'console_scripts', 'cassandra-snapshotter-agent')()
[db4a.] out:   File "/usr/local/lib/python2.7/dist-packages/cassandra_snapshotter/agent.py", line 266, in main
[db4a.] out:     args.incremental_backups
[db4a.] out:   File "/usr/local/lib/python2.7/dist-packages/cassandra_snapshotter/agent.py", line 145, in put_from_manifest
[db4a.] out:     for _ in pool.imap(upload_file, ((bucket, f, destination_path(s3_base_path, f), s3_ssenc, buffer_size) for f in files)):
[db4a.] out:   File "/usr/lib/python2.7/multiprocessing/pool.py", line 659, in next
[db4a.] out:     raise value
[db4a.] out: boto.exception.S3ResponseError: S3ResponseError: 200 OK
[db4a.] out: <?xml version="1.0" encoding="UTF-8"?>
[db4a.] out: <Error><Code>InternalError</Code><Message>We encountered an internal error. Please try again.</Message><RequestId>RIDHERE</RequestId><HostId>IDHERE</HostId></Error>
[db4a.] out: 

Fatal error: run() received nonzero return code 1 while executing!

Requested: cassandra-snapshotter-agent put (REST OF COMMAND)

Aborting.
nilstoedtmann commented 9 years ago

We are suffering from the very same issue. We have 5 nodes, and on many (but not all) of the nightly runs of cassandra_snapshotter, we get fails like this one:

    Traceback (most recent call last):
  File "/opt/demand_logic/virtual_envs/cassandra_backup/bin/cassandra-snapshotter-agent", line 9, in <module>
    load_entry_point('cassandra-snapshotter==0.3.1-dl12', 'console_scripts', 'cassandra-snapshotter-agent')()
  File "/opt/demand_logic/virtual_envs/cassandra_backup/local/lib/python2.7/site-packages/cassandra_snapshotter/agent.py", line 221, in main
    args.incremental_backups
  File "/opt/demand_logic/virtual_envs/cassandra_backup/local/lib/python2.7/site-packages/cassandra_snapshotter/agent.py", line 129, in put_from_manifest
    for _ in pool.imap(upload_file, ((bucket, f, destination_path(s3_base_path, f), s3_ssenc) for f in files)):
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 626, in next
    raise value
boto.exception.S3ResponseError: S3ResponseError: 200 OK
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>InternalError</Code><Message>We encountered an internal error. Please try again.</Message><RequestId>***************</RequestId><HostId>************</HostId></Error>

We get anything from 0 to 9 fails per night, on average 2/night. Here the fails of the past 7 days:

Sep 29 08:11:19 [cassandra01] out: boto.exception.S3ResponseError: S3ResponseError: 200 OK
Sep 30 20:16:02 [cassandra02] out: boto.exception.S3ResponseError: S3ResponseError: 200 OK
Sep 30 20:27:38 [cassandra05] out: boto.exception.S3ResponseError: S3ResponseError: 200 OK
Oct  1 05:02:03 [cassandra04] out: boto.exception.S3ResponseError: S3ResponseError: 200 OK
Oct  1 05:12:05 [cassandra05] out: boto.exception.S3ResponseError: S3ResponseError: 200 OK
Oct  1 05:29:04 [cassandra01] out: boto.exception.S3ResponseError: S3ResponseError: 200 OK
Oct  1 06:02:35 [cassandra02] out: boto.exception.S3ResponseError: S3ResponseError: 200 OK
Oct  1 20:10:43 [cassandra03] out: boto.exception.S3ResponseError: S3ResponseError: 200 OK
Oct  1 20:30:45 [cassandra05] out: boto.exception.S3ResponseError: S3ResponseError: 200 OK
Oct  1 20:31:53 [cassandra02] out: boto.exception.S3ResponseError: S3ResponseError: 200 OK
Oct  1 20:46:31 [cassandra04] out: boto.exception.S3ResponseError: S3ResponseError: 200 OK
Oct  1 22:20:17 [cassandra01] out: boto.exception.S3ResponseError: S3ResponseError: 200 OK
Oct  5 13:21:57 [cassandra04] out: boto.exception.S3ResponseError: S3ResponseError: 200 OK
Oct  5 13:31:12 [cassandra03] out: boto.exception.S3ResponseError: S3ResponseError: 200 OK
Oct  5 13:39:14 [cassandra02] out: boto.exception.S3ResponseError: S3ResponseError: 200 OK

python 2.7.6 boto 2.35.1 cassandra-snapshotter 0.3.1

The command we are using is

cassandra-snapshotter     --aws-access-key-id=XXXX     --aws-secret-access-XXXX     --s3-bucket-region=eu-west-1     --s3-bucket-name=SOME_BUCKET     --s3-base-path=DLCluster backup   --agent-virtualenv=/opt/demand_logic/virtual_envs/cassandra_backup     --backup-schema --hosts=cassandra05,cassandra04,cassandra03,cassandra02,cassandra01     --user=cassandra  --sshkey=/home/admin/.ssh/cassandra_backup.rsa  --no-sudo    --keyspaces=SOME_KEYSPACE

( We added the options "--agent-virtualenv", "--sshkey" and "--no-sudo" in our fork )

tbarbugli commented 9 years ago

I am trying to reproduce this issue. Could you share some information about the size of the files uploaded to S3?

igchub commented 9 years ago

Same issue

boto-2.38.0 python-2.7.9 cassandra_snapshotter 0.4.0

running on a node outside of cassandra cluster with command

sudo /usr/local/bin/cassandra-snapshotter --aws-access-key-id=xxxxxxxx --aws-secret-access-key=yyyyyyyy --s3-bucket-name=zzzzzzz --s3-bucket-region=eu-west-1 --s3-base-path=backups/cassandra backup --hosts=cassandra01,cassandra02,cassandra03 --user=user --use-sudo=yes --new-snapshot

Size of backup is near 80GB per cassandra host in 2 keyspaces and 1 table in each. The exception is as mentioned above.

Additional details - no ring and manifest files ever written. Most often data files in hosts folders on S3 are incomplete after finishing cassandra_snapshotter. Sometimes some keyspaces are missing for some hosts (each next run host lacking keyspace folder is different). Sometimes keyspace and column family folder is there but it is smaller than expected so aparently not all data files were copied into.

When failure happens for some of cassandra hosts the Exception is the following

user: [cassandra02] out: lzop 1.03 user: [cassandra02] out: LZO library 2.08 user: [cassandra02] out: Copyright (C) 1996-2010 Markus Franz Xaver Johannes Oberhumer user: [cassandra02] out: Traceback (most recent call last): user: [cassandra02] out: File "/usr/bin/cassandra-snapshotter-agent", line 9, in user: [cassandra02] out: load_entry_point('cassandra-snapshotter==0.4.0', 'console_scripts', 'cassandra-snapshotter-agent')() user: [cassandra02] out: File "/usr/local/lib/python2.7/site-packages/cassandra_snapshotter/agent.py", line 266, in main user: [cassandra02] out: args.incremental_backups user: [cassandra02] out: File "/usr/local/lib/python2.7/site-packages/cassandra_snapshotter/agent.py", line 145, in put_frommanifest user: [cassandra02] out: for in pool.imap(upload_file, ((bucket, f, destination_path(s3_base_path, f), s3_ssenc, buffersize) for f in files)): user: [cassandra02] out: File "/usr/lib64/python2.7/multiprocessing/pool.py", line 659, in next user: [cassandra02] out: raise value user: [cassandra02] out: boto.exception.S3ResponseError: S3ResponseError: 200 OK user: [cassandra02] out: <?xml version="1.0" encoding="UTF-8"?> user: [cassandra02] out: InternalErrorWe encountered an internal error. Please try again.**_= user: [cassandra02] out: lzop: Broken pipe: user: [cassandra02] out: lzop: Broken pipe:

zachaller commented 9 years ago

We are also having this exact same issue.

45 Node Cluster (backing up 15 nodes) about 1TB per node SSTables size 160MB.

[cassa-e1d-07] out: Traceback (most recent call last): [cassa-e1d-07] out: File "/usr/local/bin/cassandra-snapshotter-agent", line 9, in [cassa-e1d-07] out: load_entry_point('cassandra-snapshotter==0.4.0', 'console_scripts', 'cassandra-snapshotter-agent')() [cassa-e1d-07] out: File "/usr/local/lib/python2.7/dist-packages/cassandra_snapshotter/agent.py", line 266, in main [cassa-e1d-07] out: args.incremental_backups [cassa-e1d-07] out: File "/usr/local/lib/python2.7/dist-packages/cassandra_snapshotter/agent.py", line 145, in put_frommanifest [cassa-e1d-07] out: for in pool.imap(upload_file, ((bucket, f, destination_path(s3_base_path, f), s3_ssenc, buffer_size) for f in files)): [cassa-e1d-07] out: File "/usr/lib/python2.7/multiprocessing/pool.py", line 659, in next [cassa-e1d-07] out: raise value [cassa-e1d-07] out: boto.exception.S3ResponseError: S3ResponseError: 200 OK [cassa-e1d-07] out: <?xml version="1.0" encoding="UTF-8"?> [cassa-e1d-07] out: InternalErrorWe encountered an internal error. Please try again.F98D35**_6E289Uq9kplpDm9w4tV5G1gpH9e_***mS4cseW2ugWYJve6+ww6QLf3QNV7sBeiw= [cassa-e1d-07] out: lzop: Broken pipe: [cassa-e1d-07] out: lzop: Broken pipe: [cassa-e1d-07] out:

sardonic2 commented 9 years ago

On my end the file sizes were around 160MB. Density wise it was 1.3TB per node over 6 nodes.

tbarbugli commented 9 years ago

can anyone try to downgrade to 0.3.1 and share results? I am trying to reproduce this on 0.4.0 without luck so far

tbarbugli commented 9 years ago

@sardonic2 @zachaller @igchub can any of you add this at the top of agent.py to enable boto debug logging on stdout? Also can you add the S3 region for you backup bucket here?

import boto

boto.set_stream_logger('boto')
nilstoedtmann commented 9 years ago

I am using 0.3.1 and get this error (see https://github.com/tbarbugli/cassandra_snapshotter/issues/56#issuecomment-145831983 ).

We have 650 GB data total over 5 nodes, the SST files are 120-130GB per node. Largest individual SST file per node varies from 25 to 62GB

We take a new snapshot once a week, and differential backups all other nights.

Region = eu-west-1

tbarbugli commented 9 years ago

if any of you guys can add the boto logging snippet, it would be great. I am having issues reproducing this in an isolated environment.

zachaller commented 9 years ago

Have been trying to get the debug logging to run, in doing so I am not sure if it is not happening or if by enable it its just a lot harder to see any errors due to the amount of output. I also am having issue with RAM when debugging is turned and I run out of ram up to a r3.xl currently but i do not have high hopes for this to work either.

Some other info

S3: us-east-1 using VPC with using S3 endpoint Cassandra 2.1.9 cassandra_snapshotter 0.4.0 also tried with 0.3.1 (used to run great but had to turn backups off for a while recently tried to turn them on again and started getting this issue)

tbarbugli commented 9 years ago

on what end do you get high memory usage? perhaps you can try setting the log level bit higher:

import boto

boto.set_stream_logger('boto', level=logging.WARNING)

On Wed, Oct 14, 2015 at 10:21 AM Zach Aller notifications@github.com wrote:

Have been trying to get the debug logging to run, in doing so I am not sure if it is not happening or if by enable it its just a lot harder to see any errors due to the amount of output. I also am having issue with RAM when debugging is turned and I run out of ram up to a r3.xl currently but i do not have high hopes for this to work either.

— Reply to this email directly or view it on GitHub https://github.com/tbarbugli/cassandra_snapshotter/issues/56#issuecomment-148105152 .

zachaller commented 9 years ago

Ok will try that and the ram issue was on the box that started the backup, and i edited the code on all the cassandra nodes.

tbarbugli commented 9 years ago

alright, my guess is that default boto log level (DEBUG) outputs too much and the main backup process gets overwhelmed. perhaps it makes sense to tee the output to file for later inspection

On Wed, Oct 14, 2015 at 10:35 AM Zach Aller notifications@github.com wrote:

Ok will try that and the ram issue was on the box that started the backup, and i edited the code on all the cassandra nodes.

— Reply to this email directly or view it on GitHub https://github.com/tbarbugli/cassandra_snapshotter/issues/56#issuecomment-148109098 .

zachaller commented 9 years ago

It was running out of memory even when doing a redirection of output at the end of the command, the setting logging to WARNING is running ok however

cassabackup.log 2>&1

zachaller commented 9 years ago

With logging set to warning but did not seem to get any more info unless i set it wrong

/usr/local/bin/cassandra-snapshotter-agent

import boto
import logging
#boto.set_stream_logger('boto')
boto.set_stream_logger('boto', level=logging.WARNING)
Fatal error: sudo() received nonzero return code 1 while executing!

Requested: cassandra-snapshotter-agent              put             --aws-access-key-id=************             --aws-secret-access-key=**************             --s3-bucket-name=gf-backup-cassandra             --s3-bucket-region=us-east-1              --s3-base-path=ec2_east_daily/20151014163112/cassa-e1d-04             --manifest=/tmp/backupmanifest             --bufsize=64             --concurrency=4
Executed: sudo -S -p 'sudo password:'  /bin/bash -l -c "cassandra-snapshotter-agent              put             --aws-access-key-id=****************             --aws-secret-access-key=******************             --s3-bucket-name=gf-backup-cassandra             --s3-bucket-region=us-east-1              --s3-base-path=ec2_east_daily/20151014163112/cassa-e1d-04             --manifest=/tmp/backupmanifest             --bufsize=64             --concurrency=4"

Aborting.
[cassa-e1d-04] sudo: cassandra-snapshotter-agent              create-upload-manifest             --manifest_path=/tmp/backupmanifest             --snapshot_name=20151014163112             --snapshot_keyspaces=             --snapshot_table=             --conf_path=/etc/cassandra/
[cassa-e1d-04] sudo: cassandra-snapshotter-agent              put             --aws-access-key-id=****************             --aws-secret-access-key=*********************             --s3-bucket-name=gf-backup-cassandra             --s3-bucket-region=us-east-1              --s3-base-path=ec2_east_daily/20151014163112/cassa-e1d-04             --manifest=/tmp/backupmanifest             --bufsize=64             --concurrency=4
[cassa-e1d-04] out: lzop 1.03
[cassa-e1d-04] out: LZO library 2.06
[cassa-e1d-04] out: Copyright (C) 1996-2010 Markus Franz Xaver Johannes Oberhumer
[cassa-e1d-04] out: Traceback (most recent call last):
[cassa-e1d-04] out:   File "/usr/local/bin/cassandra-snapshotter-agent", line 13, in <module>
[cassa-e1d-04] out:     load_entry_point('cassandra-snapshotter==0.4.0', 'console_scripts', 'cassandra-snapshotter-agent')()
[cassa-e1d-04] out:   File "/usr/local/lib/python2.7/dist-packages/cassandra_snapshotter/agent.py", line 266, in main
[cassa-e1d-04] out:     args.incremental_backups
[cassa-e1d-04] out:   File "/usr/local/lib/python2.7/dist-packages/cassandra_snapshotter/agent.py", line 145, in put_from_manifest
[cassa-e1d-04] out:     for _ in pool.imap(upload_file, ((bucket, f, destination_path(s3_base_path, f), s3_ssenc, buffer_size) for f in files)):
[cassa-e1d-04] out:   File "/usr/lib/python2.7/multiprocessing/pool.py", line 659, in next
[cassa-e1d-04] out:     raise value
[cassa-e1d-04] out: boto.exception.S3ResponseError: S3ResponseError: 200 OK
[cassa-e1d-04] out: <?xml version="1.0" encoding="UTF-8"?>
[cassa-e1d-04] out: <Error><Code>InternalError</Code><Message>We encountered an internal error. Please try again.</Message><RequestId>363F4C7F3159</RequestId><HostId>xEz/zOVxDP9Qunc/IpFU/vqiFkBrUn0h8AhmY4awt2tjbkBIBRQY</HostId></Error>
[cassa-e1d-04] out: lzop: Broken pipe: <stdout>
[cassa-e1d-04] out:
tbarbugli commented 9 years ago

have you tried changing the concurrency settings at agent level? can you try to set the size of the pool in agent.py to 1 and see if this still happens?

On Wed, Oct 14, 2015 at 11:00 AM Zach Aller notifications@github.com wrote:

With logging set to warning but did not seem to get any more info unless i set it wrong

/usr/local/bin/cassandra-snapshotter-agent

import boto import logging

boto.set_stream_logger('boto')

boto.set_stream_logger('boto', level=logging.WARNING)

Fatal error: sudo() received nonzero return code 1 while executing!

Requested: cassandra-snapshotter-agent put --aws-access-key-id=**** --aws-secret-access-key=****** --s3-bucket-name=gf-backup-cassandra --s3-bucket-region=us-east-1 --s3-base-path=ec2_east_daily/20151014163112/cassa-e1d-04 --manifest=/tmp/backupmanifest --bufsize=64 --concurrency=4 Executed: sudo -S -p 'sudo password:' /bin/bash -l -c "cassandra-snapshotter-agent put --aws-access-key-id=**** --aws-secret-access-key=****** --s3-bucket-name=gf-backup-cassandra --s3-bucket-region=us-east-1 --s3-base-path=ec2_east_daily/20151014163112/cassa-e1d-04 --manifest=/tmp/backupmanifest --bufsize=64 --concurrency=4"

Aborting. [cassa-e1d-04] sudo: cassandra-snapshotter-agent create-upload-manifest --manifest_path=/tmp/backupmanifest --snapshot_name=20151014163112 --snapshot_keyspaces= --snapshot_table= --conf_path=/etc/cassandra/ [cassa-e1d-04] sudo: cassandra-snapshotter-agent put --aws-access-key-id=AKIAIW3MLRNMAG6RPE4Q --aws-secret-access-key=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 --s3-bucket-name=gf-backup-cassandra --s3-bucket-region=us-east-1 --s3-base-path=ec2_east_daily/20151014163112/cassa-e1d-04 --manifest=/tmp/backupmanifest --bufsize=64 --concurrency=4 [cassa-e1d-04] out: lzop 1.03 [cassa-e1d-04] out: LZO library 2.06 [cassa-e1d-04] out: Copyright (C) 1996-2010 Markus Franz Xaver Johannes Oberhumer [cassa-e1d-04] out: Traceback (most recent call last): [cassa-e1d-04] out: File "/usr/local/bin/cassandra-snapshotter-agent", line 13, in [cassa-e1d-04] out: load_entry_point('cassandra-snapshotter==0.4.0', 'console_scripts', 'cassandra-snapshotter-agent')() [cassa-e1d-04] out: File "/usr/local/lib/python2.7/dist-packages/cassandra_snapshotter/agent.py", line 266, in main [cassa-e1d-04] out: args.incremental_backups [cassa-e1d-04] out: File "/usr/local/lib/python2.7/dist-packages/cassandra_snapshotter/agent.py", line 145, in put_frommanifest [cassa-e1d-04] out: for in pool.imap(upload_file, ((bucket, f, destination_path(s3_base_path, f), s3_ssenc, buffer_size) for f in files)): [cassa-e1d-04] out: File "/usr/lib/python2.7/multiprocessing/pool.py", line 659, in next [cassa-e1d-04] out: raise value [cassa-e1d-04] out: boto.exception.S3ResponseError: S3ResponseError: 200 OK [cassa-e1d-04] out: <?xml version="1.0" encoding="UTF-8"?> [cassa-e1d-04] out: InternalErrorWe encountered an internal error. Please try again.363F4C7F3159xEz/zOVxDP9Qunc/IpFU/vqiFkBrUn0h8AhmY4awt2tjbkBIBRQY [cassa-e1d-04] out: lzop: Broken pipe: [cassa-e1d-04] out:

— Reply to this email directly or view it on GitHub https://github.com/tbarbugli/cassandra_snapshotter/issues/56#issuecomment-148117176 .

zachaller commented 9 years ago

I am trying with a concurrency of 1 right now but if I drop the pool size i run into the other issues i had were only one node gets backed up and its the last node on the list

https://github.com/tbarbugli/cassandra_snapshotter/pull/50

tbarbugli commented 9 years ago

sorry, I was not really specific. The concurrency setting I refer to is related to S3 file upload: https://github.com/tbarbugli/cassandra_snapshotter/blob/master/cassandra_snapshotter/agent.py#L144

On Wed, Oct 14, 2015 at 11:35 AM Zach Aller notifications@github.com wrote:

I am trying with a concurrency of 1 right now but if I drop the pool size i run into the other issues i had were only one node gets backed up and its the last node on the list

50 https://github.com/tbarbugli/cassandra_snapshotter/pull/50

— Reply to this email directly or view it on GitHub https://github.com/tbarbugli/cassandra_snapshotter/issues/56#issuecomment-148128335 .

zachaller commented 9 years ago

after following the code a bit i set this line to 1

https://github.com/tbarbugli/cassandra_snapshotter/blob/master/cassandra_snapshotter/agent.py#L29

It seems to be the same thing since i am using the default, also it is still running and will be for a while but it has run the longest so far but its probably too early to tell yet. The upload rate of the cassandra boxes are obviously quite a bit less also do to the one thread. Will report back after it completes or if it fails.

zachaller commented 9 years ago

ahhh spoke too soon

[cassa-e1d-01] out: Copyright (C) 1996-2010 Markus Franz Xaver Johannes Oberhumer
[cassa-e1d-07] out: Traceback (most recent call last):
[cassa-e1d-07] out:   File "/usr/local/bin/cassandra-snapshotter-agent", line 13, in <module>
[cassa-e1d-07] out:     load_entry_point('cassandra-snapshotter==0.4.0', 'console_scripts', 'cassandra-snapshotter-agent')()
[cassa-e1d-07] out:   File "/usr/local/lib/python2.7/dist-packages/cassandra_snapshotter/agent.py", line 266, in main
[cassa-e1d-07] out:     args.incremental_backups
[cassa-e1d-07] out:   File "/usr/local/lib/python2.7/dist-packages/cassandra_snapshotter/agent.py", line 145, in put_from_manifest
[cassa-e1d-07] out:     for _ in pool.imap(upload_file, ((bucket, f, destination_path(s3_base_path, f), s3_ssenc, buffer_size) for f in files)):
[cassa-e1d-07] out:   File "/usr/lib/python2.7/multiprocessing/pool.py", line 659, in next
[cassa-e1d-07] out:     raise value
[cassa-e1d-07] out: boto.exception.S3ResponseError: S3ResponseError: 200 OK
[cassa-e1d-07] out: <?xml version="1.0" encoding="UTF-8"?>
[cassa-e1d-07] out: <Error><Code>InternalError</Code><Message>We encountered an internal error. Please try again.</Message><RequestId>E7A0A31A2290</RequestId><HostId>tSz5aWJpIOWcc8KzlQFBL9KEfWowVbC9e00c6iIPOe0A+Jroy2h4jSIK4</HostId></Error>
[cassa-e1d-07] out: lzop: Broken pipe: <stdout>
[cassa-e1d-07] out: lzop: Broken pipe: <stdout>
tbarbugli commented 9 years ago

TBH I am glad this is not a concurrency problem :) I am going to push some changes to master for agent.py in half hour, that might either fix the issue or give us a better idea about what's going on with S3

On Wed, Oct 14, 2015 at 12:25 PM Zach Aller notifications@github.com wrote:

ahhh spoke too soon

[cassa-e1d-01] out: Copyright (C) 1996-2010 Markus Franz Xaver Johannes Oberhumer [cassa-e1d-07] out: Traceback (most recent call last): [cassa-e1d-07] out: File "/usr/local/bin/cassandra-snapshotter-agent", line 13, in [cassa-e1d-07] out: load_entry_point('cassandra-snapshotter==0.4.0', 'console_scripts', 'cassandra-snapshotter-agent')() [cassa-e1d-07] out: File "/usr/local/lib/python2.7/dist-packages/cassandra_snapshotter/agent.py", line 266, in main [cassa-e1d-07] out: args.incremental_backups [cassa-e1d-07] out: File "/usr/local/lib/python2.7/dist-packages/cassandra_snapshotter/agent.py", line 145, in put_frommanifest [cassa-e1d-07] out: for in pool.imap(upload_file, ((bucket, f, destination_path(s3_base_path, f), s3_ssenc, buffer_size) for f in files)): [cassa-e1d-07] out: File "/usr/lib/python2.7/multiprocessing/pool.py", line 659, in next [cassa-e1d-07] out: raise value [cassa-e1d-07] out: boto.exception.S3ResponseError: S3ResponseError: 200 OK [cassa-e1d-07] out: <?xml version="1.0" encoding="UTF-8"?> [cassa-e1d-07] out: InternalErrorWe encountered an internal error. Please try again.E7A0A31A2290tSz5aWJpIOWcc8KzlQFBL9KEfWowVbC9e00c6iIPOe0A+Jroy2h4jSIK4 [cassa-e1d-07] out: lzop: Broken pipe: [cassa-e1d-07] out: lzop: Broken pipe:

— Reply to this email directly or view it on GitHub https://github.com/tbarbugli/cassandra_snapshotter/issues/56#issuecomment-148145053 .

tbarbugli commented 9 years ago

agent.py on master has now better logging capabilities, also I should have fixed a bunch of corner cases with retries and multipart upload error handling

zachaller commented 9 years ago

bad news same issue

[cassa-e1d-12] out: lzop 1.03
[cassa-e1d-12] out: LZO library 2.06
[cassa-e1d-12] out: Copyright (C) 1996-2010 Markus Franz Xaver Johannes Oberhumer
[cassa-e1d-08] out: Traceback (most recent call last):
[cassa-e1d-08] out:   File "/usr/local/bin/cassandra-snapshotter-agent", line 9, in <module>
[cassa-e1d-08] out:     load_entry_point('cassandra-snapshotter==0.4.0', 'console_scripts', 'cassandra-snapshotter-agent')()
[cassa-e1d-08] out:   File "/usr/local/lib/python2.7/dist-packages/cassandra_snapshotter/agent.py", line 281, in main
[cassa-e1d-08] out:     args.incremental_backups
[cassa-e1d-08] out:   File "/usr/local/lib/python2.7/dist-packages/cassandra_snapshotter/agent.py", line 158, in put_from_manifest
[cassa-e1d-08] out:     for ret in pool.imap(upload_file, ((bucket, f, destination_path(s3_base_path, f), s3_ssenc, buffer_size) for f in files)):
[cassa-e1d-08] out:   File "/usr/lib/python2.7/multiprocessing/pool.py", line 659, in next
[cassa-e1d-08] out:     raise value
[cassa-e1d-08] out: boto.exception.S3ResponseError: S3ResponseError: 200 OK
[cassa-e1d-08] out: <?xml version="1.0" encoding="UTF-8"?>
[cassa-e1d-08] out: <Error><Code>InternalError</Code><Message>We encountered an internal error. Please try again.</Message><RequestId>7B8614903A92886C</RequestId><HostId>gTrs3Q40C6Ia4SJkT+mBHRB+FemhVumY9pQfDy6kzlXWpUZGFeqZsK5E80ljw+q66n/nh3Ncwho=</HostId></Error>
[cassa-e1d-08] out: lzop: Broken pipe: <stdout>
[cassa-e1d-08] out: lzop: Broken pipe: <stdout>
[cassa-e1d-08] out: lzop: Broken pipe: <stdout>
[cassa-e1d-08] out:

Fatal error: sudo() received nonzero return code 1 while executing!
tbarbugli commented 9 years ago

alright, added even more logging to some parts of code that might be causing this. thank you for your patience and help :)

On Wed, Oct 14, 2015 at 3:09 PM Zach Aller notifications@github.com wrote:

bad news same issue

[cassa-e1d-12] out: lzop 1.03 [cassa-e1d-12] out: LZO library 2.06 [cassa-e1d-12] out: Copyright (C) 1996-2010 Markus Franz Xaver Johannes Oberhumer [cassa-e1d-08] out: Traceback (most recent call last): [cassa-e1d-08] out: File "/usr/local/bin/cassandra-snapshotter-agent", line 9, in [cassa-e1d-08] out: load_entry_point('cassandra-snapshotter==0.4.0', 'console_scripts', 'cassandra-snapshotter-agent')() [cassa-e1d-08] out: File "/usr/local/lib/python2.7/dist-packages/cassandra_snapshotter/agent.py", line 281, in main [cassa-e1d-08] out: args.incremental_backups [cassa-e1d-08] out: File "/usr/local/lib/python2.7/dist-packages/cassandra_snapshotter/agent.py", line 158, in put_from_manifest [cassa-e1d-08] out: for ret in pool.imap(upload_file, ((bucket, f, destination_path(s3_base_path, f), s3_ssenc, buffer_size) for f in files)): [cassa-e1d-08] out: File "/usr/lib/python2.7/multiprocessing/pool.py", line 659, in next [cassa-e1d-08] out: raise value [cassa-e1d-08] out: boto.exception.S3ResponseError: S3ResponseError: 200 OK [cassa-e1d-08] out: <?xml version="1.0" encoding="UTF-8"?> [cassa-e1d-08] out: InternalErrorWe encountered an internal error. Please try again.7B8614903A92886CgTrs3Q40C6Ia4SJkT+mBHRB+FemhVumY9pQfDy6kzlXWpUZGFeqZsK5E80ljw+q66n/nh3Ncwho= [cassa-e1d-08] out: lzop: Broken pipe: [cassa-e1d-08] out: lzop: Broken pipe: [cassa-e1d-08] out: lzop: Broken pipe: [cassa-e1d-08] out:

Fatal error: sudo() received nonzero return code 1 while executing!

— Reply to this email directly or view it on GitHub https://github.com/tbarbugli/cassandra_snapshotter/issues/56#issuecomment-148201375 .

zachaller commented 9 years ago
[cassa-e1d-05] out: Copyright (C) 1996-2010 Markus Franz Xaver Johannes Oberhumer
[cassa-e1d-05] out: cassandra_snapshotter.agent ERROR    MSG: Error completing multipart upload for file /ephemeral/cassandra/data/wxgrid_long_lived_v2_data/grid/snapshots/20151015020934/wxgrid_long_lived_v2_data-grid-ka-1248573-Index.db to ec2_east_daily/20151015020934/cassa-e1d-05//ephemeral/cassandra/data/wxgrid_long_lived_v2_data/grid/snapshots/20151015020934/wxgrid_long_lived_v2_data-grid-ka-1248573-Index.db.lzo
[cassa-e1d-05] out:         STRUCTURED: time=2015-10-15T02:47:01.910852-00 pid=25043
[cassa-e1d-05] out: cassandra_snapshotter.agent ERROR    MSG: We encountered an internal error. Please try again.
[cassa-e1d-05] out:         STRUCTURED: time=2015-10-15T02:47:01.911333-00 pid=25043
[cassa-e1d-05] out: cassandra_snapshotter.agent ERROR    MSG: Error while cancelling multipart upload
[cassa-e1d-05] out:         STRUCTURED: time=2015-10-15T02:47:05.951088-00 pid=25043
[cassa-e1d-05] out: cassandra_snapshotter.agent ERROR    MSG: Error while cancelling multipart upload
[cassa-e1d-05] out:         STRUCTURED: time=2015-10-15T02:47:09.987280-00 pid=25043

With this portion repeating untill i killed it

[cassa-e1d-05] out:         STRUCTURED: time=2015-10-15T02:47:05.951088-00 pid=25043
[cassa-e1d-05] out: cassandra_snapshotter.agent ERROR    MSG: Error while cancelling multipart upload
zachaller commented 9 years ago

My quick look at your code and the logs it would seem i should see some retry logging output but i do not see any its late now so will look more at it in the morning but my first thought is it dose not seem to be retrying

tbarbugli commented 9 years ago

what's the size of the file that fails?

tbarbugli commented 9 years ago

I added more logging including the dump of the XML for the complete upload request on master

zachaller commented 9 years ago

Here is the size I will try update and start another run right now

cassa-e1d-05 /e/cassandra# ls -lah /ephemeral/cassandra/data/wxgrid_long_lived_v2_data/grid/wxgrid_long_lived_v2_data-grid-ka-1248573-Index.db
-rw-r--r-- 2 cassandra cassandra 15M Sep 30 07:15 /ephemeral/cassandra/data/wxgrid_long_lived_v2_data/grid/wxgrid_long_lived_v2_data-grid-ka-1248573-Index.db
tbarbugli commented 9 years ago

file size does not look really interesting, I can 't find anything clearly wrong with it when I look at S3 specs

zachaller commented 9 years ago

It looks like it continues now i thought it was runing longer than before did some seaching through the log file found this

[cassa-e1d-03] out: cassandra_snapshotter.agent INFO     MSG: Initialized multipart upload for file /ephemeral/cassandra/data/wxgrid_long_lived_v2_data/grid/snapshots/20151015160953/wxgrid_long_lived_v2_data-grid-ka-2054131-Summary.db to ec2_east_daily/20151015160953/
cassa-e1d-03//ephemeral/cassandra/data/wxgrid_long_lived_v2_data/grid/snapshots/20151015160953/wxgrid_long_lived_v2_data-grid-ka-2054131-Summary.db.lzo
[cassa-e1d-03] out:         STRUCTURED: time=2015-10-15T16:17:57.302001-00 pid=16713
[cassa-e1d-03] out: cassandra_snapshotter.agent INFO     MSG: Initialized multipart upload for file /ephemeral/cassandra/data/wxgrid_long_lived_v2_data/grid/snapshots/20151015160953/wxgrid_long_lived_v2_data-grid-ka-1878941-Digest.sha1 to ec2_east_daily/20151015160953
/cassa-e1d-03//ephemeral/cassandra/data/wxgrid_long_lived_v2_data/grid/snapshots/20151015160953/wxgrid_long_lived_v2_data-grid-ka-1878941-Digest.sha1.lzo
[cassa-e1d-03] out:         STRUCTURED: time=2015-10-15T16:17:57.460207-00 pid=16713
[cassa-e1d-03] out: cassandra_snapshotter.agent ERROR    MSG: Error completing multipart upload for file /ephemeral/cassandra/data/wxgrid_long_lived_v2_data/grid/snapshots/20151015160953/wxgrid_long_lived_v2_data-grid-ka-1892112-TOC.txt to ec2_east_daily/2015101516095
3/cassa-e1d-03//ephemeral/cassandra/data/wxgrid_long_lived_v2_data/grid/snapshots/20151015160953/wxgrid_long_lived_v2_data-grid-ka-1892112-TOC.txt.lzo
[cassa-e1d-03] out:         STRUCTURED: time=2015-10-15T16:17:57.468279-00 pid=16713
[cassa-e1d-03] out: cassandra_snapshotter.agent ERROR    MSG: We encountered an internal error. Please try again.
[cassa-e1d-03] out:         STRUCTURED: time=2015-10-15T16:17:57.469028-00 pid=16713
[cassa-e1d-03] out: cassandra_snapshotter.agent ERROR    MSG: <CompleteMultipartUpload>
[cassa-e1d-03] out:           <Part>
[cassa-e1d-03] out:             <PartNumber>1</PartNumber>
[cassa-e1d-03] out:             <ETag>"baf143bd0580b4ee95ee4dab647ccf95"</ETag>
[cassa-e1d-03] out:           </Part>
[cassa-e1d-03] out:         </CompleteMultipartUpload>
[cassa-e1d-03] out:         STRUCTURED: time=2015-10-15T16:17:57.490009-00 pid=16713
[cassa-e1d-03] out: cassandra_snapshotter.agent INF_daily/20151015160953/cassa-e1d-10//ephemeral/cassandra/data/soil_model_v2_data/grid/snapshots/20151015160953/soil_model_v2_data-grid-ka-374913-Statistics.db.lzo
[cassa-e1d-10] out:         STRUCTURED: time=2015-10-15T16:17:57.501288-00 pid=1278
[cassa-e1d-10] out: cassandra_snapshotter.agent INFO     MSG: Initialized multipart upload for file /ephemeral/cassandra/data/soil_model_v2_data/grid/snapshots/20151015160953/soil_model_v2_data-grid-jb-241027-Index.db to ec2_east_daily/20151015160953/cassa-e1d-10//ephemeral/cassandra/data/soil_model_v2_data/grid/snapshots/20151015160953/soil_model_v2_data-grid-jb-241027-Index.db.lzo
[cassa-e1d-10] out:         STRUCTURED: time=2015-10-15T16:17:57.620265-00 pid=1278
[cassa-e1d-10] out: cassandra_snapshotter.agent INFO     MSG: Initialized multipart upload for file /ephemeral/cassandra/data/soil_model_v2_data/grid/snapshots/20151015160953/soil_model_v2_data-grid-ka-434330-Digest.sha1 to ec2_east_daily/20151015160953/cassa-e1d-10//ephemeral/cassandra/data/soil_model_v2_data/grid/snapshots/20151015160953/soil_model_v2_data-grid-ka-434330-Digest.sha1.lzo
[cassa-e1d-10] out:         STRUCTURED: time=2015-10-15T16:17:57.723364-00 pid=1278
zachaller commented 9 years ago

Would you expect this line to be called to attempt a retry or not

https://github.com/tbarbugli/cassandra_snapshotter/blob/9548451b80921acd31910c94a04d9092c0fc0293/cassandra_snapshotter/agent.py#L109

here is one more snippit of log two with multiple parts

[cassa-e1d-12] out: cassandra_snapshotter.agent INFO     MSG: Initialized multipart upload for file /ephemeral/cassandra/data/soil_model_v2_data/grid/snapshots/20151015160953/soil_model_v2_data-grid-ka-417837-Index.db to ec2_east_daily/20151015160953/cassa-e1d-12//eph
emeral/cassandra/data/soil_model_v2_data/grid/snapshots/20151015160953/soil_model_v2_data-grid-ka-417837-Index.db.lzo
[cassa-e1d-12] out:         STRUCTURED: time=2015-10-15T16:27:20.474632-00 pid=19955
[cassa-e1d-12] out: cassandra_snapshotter.agent ERROR    MSG: Error completing multipart upload for file /ephemeral/cassandra/data/soil_model_v2_data/grid/snapshots/20151015160953/soil_model_v2_data-grid-ka-451954-Data.db to ec2_east_daily/20151015160953/cassa-e1d-12/
/ephemeral/cassandra/data/soil_model_v2_data/grid/snapshots/20151015160953/soil_model_v2_data-grid-ka-451954-Data.db.lzo
[cassa-e1d-12] out:         STRUCTURED: time=2015-10-15T16:27:20.485370-00 pid=19955
[cassa-e1d-12] out: cassandra_snapshotter.agent ERROR    MSG: We encountered an internal error. Please try again.
[cassa-e1d-12] out:         STRUCTURED: time=2015-10-15T16:27:20.485718-00 pid=19955
[cassa-e1d-12] out: cassandra_snapshotter.agent ERROR    MSG: <CompleteMultipartUpload>
[cassa-e1d-12] out:           <Part>
[cassa-e1d-12] out:             <PartNumber>1</PartNumber>
[cassa-e1d-12] out:             <ETag>"31569bf77c40d1e7b629012556573b2e"</ETag>
[cassa-e1d-12] out:           </Part>
[cassa-e1d-12] out:           <Part>
[cassa-e1d-12] out:             <PartNumber>2</PartNumber>
[cassa-e1d-12] out:             <ETag>"a087ab7309593942c8515d701e2442c5"</ETag>
[cassa-e1d-12] out:           </Part>
[cassa-e1d-12] out:           <Part>
[cassa-e1d-12] out:             <PartNumber>3</PartNumber>
[cassa-e1d-12] out:             <ETag>"971a54b40fd7d11e8fd171d742e4a4ec"</ETag>
[cassa-e1d-12] out:           </Part>
[cassa-e1d-12] out:         </CompleteMultipartUpload>
[cassa-e1d-12] out:         STRUCTURED: time=2015-10-15T16:27:20.504481-00 pid=19955
[cassa-e1d-12] out: cassandra_snapshotter.agent INFO     MSG: Initialized multipart upload for file /ephemeral/cassandra/data/soil_model_v2_data/grid/snapshots/20151015160953/soil_model_v2_data-grid-ka-381739-Statistics.db to ec2_east_daily/20151015160953/cassa-e1d-12//ephemeral/cassandra/data/soil_model_v2_data/grid/snapshots/20151015160953/soil_model_v2_data-grid-ka-381739-Statistics.db.lzo
[cassa-e1d-12] out:         STRUCTURED: time=2015-10-15T16:27:20.686339-00 pid=19955
[cassa-e1d-12] out: cassandra_snapshotter.agent INFO     MSG: Initialized multipart upload for file /ephemeral/cassandra/data/soil_model_v2_data/grid/snapshots/20151015160953/soil_model_v2_data-grid-ka-354924-Digest.sha1 to ec2_east_daily/20151015160953/cassa-e1d-12//ephemeral/cassandra/data/soil_model_v2_data/grid/snapshots/20151015160953/soil_model_v2_data-grid-ka-354924-Digest.sha1.lzo
[cassa-e1d-12] out:-Data.db to ec2_east_daily/20151015160953/cassa-e1d-10//ephemeral/cassandra/data/soil_model_v2_data/grid/snapshots/20151015160953/soil_model_v2_data-grid-ka-402207-Data.db.lzo
[cassa-e1d-10] out:         STRUCTURED: time=2015-10-15T16:27:04.102644-00 pid=1278
tbarbugli commented 9 years ago

it should only retry to upload parts or to cancel the upload. can you still say what is the file size for the file that failed to upload?

zachaller commented 9 years ago

Looks like the file in the log above was

-rw-r--r-- 2 cassandra cassandra 161M Oct 14 07:45 /ephemeral/cassandra/data/soil_model_v2_data/grid/soil_model_v2_data-grid-ka-451954-Data.db

I also had some other interesting error in the file but not sure if i cleaned the file out or not so re running again and will report back once i confirm or see the errors again.

tbarbugli commented 9 years ago

alright, let me know when you have some more information about this. BTW did you try using a version earlier than 0.4? I think this is something we introduced on the last release, it would be great to have a confirmation about that.

On Fri, Oct 16, 2015 at 10:45 AM Zach Aller notifications@github.com wrote:

Looks like the file in the log above was

-rw-r--r-- 2 cassandra cassandra 161M Oct 14 07:45 /ephemeral/cassandra/data/soil_model_v2_data/grid/soil_model_v2_data-grid-ka-451954-Data.db

I also had some other interesting error in the file but not sure if i cleaned the file out or not so re running again and will report back once i confirm or see the errors again.

— Reply to this email directly or view it on GitHub https://github.com/tbarbugli/cassandra_snapshotter/issues/56#issuecomment-148765133 .

igchub commented 9 years ago

Version 0.3.1 had the same issue

out: lzop 1.03 out: LZO library 2.08 out: Copyright (C) 1996-2010 Markus Franz Xaver Johannes Oberhumer out: Traceback (most recent call last): out: File "/usr/local/bin/cassandra-snapshotter-agent", line 9, in out: load_entry_point('cassandra-snapshotter==0.3.1', 'console_scripts', 'cassandra-snapshotter-agent')() out: File "/usr/local/lib/python2.7/site-packages/cassandra_snapshotter/agent.py", line 222, in main out: args.incremental_backups out: File "/usr/local/lib/python2.7/site-packages/cassandra_snapshotter/agent.py", line 130, in put_frommanifest out: for in pool.imap(upload_file, ((bucket, f, destination_path(s3_base_path, f), s3_ssenc) for f in files)): out: File "/usr/lib64/python2.7/multiprocessing/pool.py", line 659, in next out: raise value out: boto.exception.S3ResponseError: S3ResponseError: 200 OK out: <?xml version="1.0" encoding="UTF-8"?> out: InternalErrorWe encountered an internal error. Please try again.

Dchamard commented 9 years ago

We have the same problem happening, but only with incremental backups. I noticed that sometimes the local backupmanifest files have a lot of compaction related sstables. We have this error when the backupmanifest file looks wrong. I rerun it and now the manifest looks ok. It seems very random. Please let me know if you need more info.

This is happening with 0.3.1 and also 0.40

vimvim commented 9 years ago

The same issue: 4 nodes, 100Gb, eu-east-1 region. According to the AWS forum - this is S3 side problem. The best way to handle it is to retry operation again after some timeout. Related discussion with the links to AWS forum topics: https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1590

tbarbugli commented 9 years ago

going to add retries for every step of the multipart upload in the next release

willejs commented 9 years ago

@tbarbugli I'm running 3.1, and am experiencing this. We have seen this in S3 before when uploading vast amounts of files, and retries immediately often don't work. We found retrying the failing file later fixes this, so maybe pop that file off the list and then put it back onto the end of the list and retry later, and retry up to 15 times backing off perhaps.

dizzythinks commented 9 years ago

Experiencing this intermittently with 0.4.0. 6 nodes, over 100G on each. Tends to only occur on one node and the rest work fine.

jsgriffin commented 8 years ago

We're having this issue as well, 4 nodes with ~200GB each. As others have said, tends to only occur on one node. Has there been any progress resolving it?

CunningBaldrick commented 8 years ago

On 02/02/16 10:48, James Griffin wrote:

We're having this issue as well, 4 nodes with ~200GB each. As others have said, tends to only occur on one node. Has there been any progress resolving it?

I have a patch that fixes this for me (attached, patch.diff). I didn't push it because it goes on top of a previous patch I pushed which hasn't been pulled into the master yet. The attached patch all.diff combines the two and should apply correctly to the master.

Ciao, Duncan.

jsgriffin commented 8 years ago

@CunningBaldrick Might be missing something but I can't see the mentioned files - could you re-send?

DuncanSands commented 8 years ago

Here you go (I had to change the file types to TXT in order for github to allow them) patch.txt all.txt

.

tbarbugli commented 8 years ago

the problem seems to be related to not retrying every step of the multi-part upload. some work is on master to mitigate the problem but I did not have the time to properly write and test a patch for this. @DuncanSands can you create a PR from your patch files?

DuncanSands commented 8 years ago

Hi Tommaso, the patch touches some of the same places as this pull request: https://github.com/tbarbugli/cassandra_snapshotter/pull/69 So can you please pull that one, and I will then send a pull request that goes on top of it. I could also do things the other way round: tweak my patch for this issue so it applies to master, and fix up the above pull request (because otherwise it won't apply any more).

tbarbugli commented 8 years ago

@DuncanSands 69 is merged, can you try to create the PR again?

CunningBaldrick commented 8 years ago

On 17/02/16 10:30, Tommaso Barbugli wrote:

@DuncanSands https://github.com/DuncanSands 69 is merged, can you try to create the PR again?

I made a pull request, https://github.com/tbarbugli/cassandra_snapshotter/pull/75

The continuous integration build failed, but it doesn't seem to have anything to do with this patch. The failure is:

File "build/bdist.linux-x86_64/egg/cassandra_snapshotter/logging_helper.py", line 80 except EnvironmentError, e: ^ SyntaxError: invalid syntax

Ciao, Duncan.

tbarbugli commented 8 years ago

@DuncanSands that's a python 3 syntax issue coming from a different base. I will fix that after your PR is merged. Did you see my comments inline to your PR's code?