tbarbugli / cassandra_snapshotter

A tool to backup cassandra nodes using snapshots and incremental backups on S3
Other
222 stars 122 forks source link

snapshotter aborting while taking snapshot from a 5 TB 3 node cluster #120

Open ramo-karahasan-riechardt opened 6 years ago

ramo-karahasan-riechardt commented 6 years ago

Hi,

I'm running into trouble when using snapshotter on a 3 node DSE cluster with a volume of 5 TB.

[10.0.106.42] out: cassandra_snapshotter.agent INFO     MSG: Initialized multipart upload for file /srv/cassandra/data/archive/mail_events_by_retailer_id/snapshots/20180207214412/archive-mail_events_by_retailer_id-jb-74285-Index.db to dc1-clusterv2/20180207214412/10.0.106.42//srv/cassandra/data/archive/mail_events_by_retailer_id/snapshots/20180207214412/archive-mail_events_by_retailer_id-jb-74285-Index.db.lzo
[10.0.106.42] out:         STRUCTURED: time=2018-02-08T07:27:06.131284-00 pid=2810
[10.0.106.43] out:

[10.0.106.42] out:

[10.0.106.41] Executing task 'clear_node_snapshot'
[10.0.106.42] Executing task 'clear_node_snapshot'
[10.0.106.43] Executing task 'clear_node_snapshot'
[10.0.106.43] run: /usr/bin/nodetool clearsnapshot -t "20180207214412"
[10.0.106.42] run: /usr/bin/nodetool clearsnapshot -t "20180207214412"
[10.0.106.41] run: /usr/bin/nodetool clearsnapshot -t "20180207214412"

Fatal error: Needed to prompt for a connection or sudo password (host: 10.0.106.41), but input would be ambiguous in parallel mode

Aborting.

Fatal error: Needed to prompt for a connection or sudo password (host: 10.0.106.42), but input would be ambiguous in parallel mode

Aborting.

Fatal error: Needed to prompt for a connection or sudo password (host: 10.0.106.43), but input would be ambiguous in parallel mode

Aborting.

Fatal error: One or more hosts failed while executing task 'clear_node_snapshot'

Aborting.

I'm running DSE 5.2.4 in the cluster which as C* 2.1 as a version Furthermore Python 2.7.6

I've tried to run snapshotter with in two versions:

with user ramo which has the following configuration in visudo

ramo ALL=(ALL:ALL) NOPASSWD:ALL

and performing this command

cassandra-snapshotter --s3-bucket-name=dc-cassandra-snapshots --s3-bucket-region=eu-west-1 --s3-base-path=dc1-clusterv2 --aws-access-key-id=<> --aws-secret-access-key=<>--s3-ssenc backup --hosts=10.0.106.41,10.0.106.42,10.0.106.43 --user=ramo

and with user root

cassandra-snapshotter --s3-bucket-name=dc-cassandra-snapshots --s3-bucket-region=eu-west-1 --s3-base-path=dc1-clusterv2 --aws-access-key-id=<> --aws-secret-access-key=<>--s3-ssenc backup --hosts=10.0.106.41,10.0.106.42,10.0.106.43 --user=root

user ramo has sudo access with NOPASSWD:ALL and can also login on each node of the cluster. Furtermore user ramo has access to the /tmp folder and owns the backupmanifest file.

Any ideas what's going wrong here?

ramo-karahasan-riechardt commented 6 years ago

I've just run the tool on a 2 node C* cluster with around 25 GB of data which worked fine.

I was running it with user root.

ramo-karahasan-riechardt commented 6 years ago

I found out the issue:

even with setting all sudoers options to NOPASSWD my issues was not solved, because I'm using ssh agent forwarding and was runnint cassandra-snapshotter in a screen, which by default is not knowing about the underlying ssh session. so if you use ssh with agent key forwarding in a screen you'll be by default prompted for a password.

following this gist https://gist.github.com/martijnvermaat/8070533 fixed my issue of being able to connect without password to the other hosts.

ramo-karahasan-riechardt commented 6 years ago

I thought I've fixed it, but it again aborted at the very end with this message:

[10.0.106.41] out:

[10.0.106.42] out:

[10.0.106.43] out:

[10.0.106.41] Executing task 'clear_node_snapshot'
[10.0.106.42] Executing task 'clear_node_snapshot'
[10.0.106.43] Executing task 'clear_node_snapshot'
[10.0.106.42] run: /usr/bin/nodetool clearsnapshot -t "20180208103618"
[10.0.106.43] run: /usr/bin/nodetool clearsnapshot -t "20180208103618"
[10.0.106.41] run: /usr/bin/nodetool clearsnapshot -t "20180208103618"

Fatal error: Needed to prompt for a connection or sudo password (host: 10.0.106.42), but input would be ambiguous in parallel mode

Aborting.

Fatal error: Needed to prompt for a connection or sudo password (host: 10.0.106.43), but input would be ambiguous in parallel mode

Aborting.

Fatal error: Needed to prompt for a connection or sudo password (host: 10.0.106.41), but input would be ambiguous in parallel mode

Aborting.

Fatal error: One or more hosts failed while executing task 'clear_node_snapshot'

Aborting.

I don't get, it dumps 4,8 TB and at the very end it aborts with the above message. I don't see any ring or manfiest.json files backed up.

the command i was using:

cassandra-snapshotter --s3-bucket-name=dc-cassandra-snapshots --s3-bucket-region=eu-west-1 --s3-base-path=dc1-clusterv3 --aws-access-key-id=<>--aws-secret-access-key=<>--s3-ssenc backup --hosts=10.0.106.41,10.0.106.42,10.0.106.43 --user=ramo --use-sudo=yes

fab and paramiko are in the following version

Fabric 1.14.0
Paramiko 2.4.0

any idea what's going wrong? I'm inside the screen session and can ssh to every host in the cluster with the user ramo without password. I can also edit and save files like `sudo vi /etc/hosts" on every host without entering a password

any ideas?

markediez commented 6 years ago

I was able to fix the error: Fatal error: Needed to prompt for a connection or sudo password by making sure that the host running the cassandra-snapshotter script (HOST) could ssh to the listed nodes (C_NODES).

This is done by copying the id_rsa.pub of your HOST to the authorized_keys file of each C_NODE in CNODES.

Note that my HOST machine is an instance within the same VPC of the C_NODES.

ramo-karahasan-riechardt commented 6 years ago

Thanks @markediez I am able to access all nodes on the host and in a screen via SSH without being prompted for a passowrd. I did it once with root and created also a seperate user that was listed in sudoers file with NOPASSWD set on all hosts.

Again, manually it worked but while running snapshotter it didn't work. The strange issue here is: it's throwing the error after streaming 4,8 TB out of 5 TB , so at the very end.

I'm assuming that snapshotter also tries to login all hosts at the very beginning, before it streams. So that seems to work fine.

I've upgraded now my DSE version and will try to run an update with Opscenter.

I'm not sure if this ticket should be closed or not, since there was no solution found for my case