nuvo / cain

Backup and restore tool for Cassandra on Kubernetes
Apache License 2.0
32 stars 21 forks source link

Cain stuck file copy #12

Open alexbarta opened 5 years ago

alexbarta commented 5 years ago

Hi Maor,

I'm trying to restore, but almost every time some file gets stuck during the copy and cain 0.5.1 hangs.

I tried to do some tunings with buffer size/parallelism but no success, cain randomly gets stuck at certain file copy. In this state after a while, the tcp connection towards minio disappears from netstat output, but cain remains still alive.

Any idea to increase verbosity of the copy process ?

Regards

maorfr commented 5 years ago

Could this be that these are large files? Can you check in k8s if the file size changes during the copy?

alexbarta commented 5 years ago

well the schema has only a few records since I'm trying on a test minimal installation, if a do a du on the minio folder the total is less than 13M

du -skh minio/storage/db-backup/cassandra/cfs/thingsboard-cluster/thingsboard/e9bce4/20190112212203/*
3.4M    minio/storage/db-backup/cassandra/cfs/thingsboard-cluster/thingsboard/e9bce4/20190112212203/cassandra-0
3.4M    minio/storage/db-backup/cassandra/cfs/thingsboard-cluster/thingsboard/e9bce4/20190112212203/cassandra-1
3.4M    minio/storage/db-backup/cassandra/cfs/thingsboard-cluster/thingsboard/e9bce4/20190112212203/cassandra-2

btw I'm running kubernetes v1.11.5 with flannel host-gw setup

alexbarta commented 5 years ago

on cassandra-0 the total data amounts to:

sudo du -skh /mnt/disks/cassandra/data/thingsboard
9.4M    /mnt/disks/cassandra/data/thingsboard

same on the other nodes

maorfr commented 5 years ago

Lets try to narrow this down. Can you try to do a copy from minio to k8s using skbn?

alexbarta commented 5 years ago

Sure give a minute

alexbarta commented 5 years ago

ok skbn seems to be working properly

created a file on minio

sudo dd if=/dev/zero of=/mnt/NAS/repo/mounts/minio/storage/db-backup/abigfile bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 2.50469 s, 429 MB/s

Run skbn to copy that file into k8s container

 kubectl run cassandra-restore  --rm  --serviceaccount='cassandra-backup' -i --tty --restart=Never --image-pull-policy=IfNotPresent --image nuvo/skbn --env 'AWS_ACCESS_KEY_ID=admin' --env 'AWS_SECRET_ACCESS_KEY=*****' --env 'AWS_S3_NO_SSL=true' --env 'AWS_S3_FORCE_PATH_STYLE=true' --env 'AWS_S3_ENDPOINT=http://minio-svc.cfs.svc.cluster.local:9000' --command -- sh
If you don't see a command prompt, try pressing enter.
~ $
~ $ skbn cp --src s3://db-backup/abigfile --dst k8s://cfs/cassandra-0/cassandra/cassandra_data
2019/01/13 16:02:54 [1/1] copy: s3://db-backup/abigfile -> k8s://cfs/cassandra-0/cassandra/cassandra_data
2019/01/13 16:03:00 [1/1] done: s3://db-backup/abigfile -> k8s://cfs/cassandra-0/cassandra/cassandra_data

Check the file copied

 md5sum /mnt/disks/cassandra/stdin
cd573cfaace07e7949bc0c46028904ff  /mnt/disks/cassandra/stdin

md5sum /mnt/NAS/repo/mounts/minio/storage/db-backup/abigfile
cd573cfaace07e7949bc0c46028904ff  /mnt/NAS/repo/mounts/minio/storage/db-backup/abigfile
alexbarta commented 5 years ago

Maor, looking at your skbn PerformCopy code

https://github.com/nuvo/skbn/blob/42781bdb9d5cd81fcda5a6ac44a17e0480fb0e94/pkg/skbn/skbn.go#L139

I see you are using nio buffers, maybe the hang process is due to some race condition provoked by the goroutines pipew and piper. Probably converting piper goroutine to a standard function could be a good test to see if that is the cause..

When cain gets stuck I can only see "copy:" log output, the instead "done:" never appears.

What do you think?

maorfr commented 5 years ago

These routines are running concurrently, allowing copy to be done using a pipe. This has to be 2 goroutines...

maorfr commented 5 years ago

See nuvo/skbn#3 for details

alexbarta commented 5 years ago

Then the stuck is either in Download/Upload functions..

maorfr commented 5 years ago

Probably in download. Can you try the same again, but with a file that gets stuck?

alexbarta commented 5 years ago

Unfortunately is not a particular file, when running cain it randomly stops every time on different ( very small ) files. Only a couple of times It did finish the job.

Funny thing is backup that runs 2x faster and it never gets stuck

alexbarta commented 5 years ago

this is a short gif of the stuck cainstuck

maorfr commented 5 years ago

If minio is a pod in the cluster, you can try treating it as k8s://... Give it a shot, as a work around :)

alexbarta commented 5 years ago

Cool idea ! I will try thanks

alexbarta commented 5 years ago

no luck I got stuck here this time :(

cain restore --src 'k8s://cfs/minio-deployment-6655ffc669-ph868/minio/storage/db-backup/cassandra/cfs/thingsboard-cluster' -n cfs -k thingsboard  -t
 20190112212203 --cassandra-data-dir /cassandra_data/data  --buffer-size 1 -l app=cassandra
...
2019/01/13 17:37:18 [0372/1674] copy: k8s://cfs/minio-deployment-6655ffc669-ph868/minio/storage/db-backup/cassandra/cfs/thingsboard-cluster/thingsboard/e9bce4/20190112212203/cassandra-0/event_by_id/manifest.json -> k8s://cfs/cassandra-0/cassandra/cassandra_data/data/thingsboard/event_by_id-42d57b20174511e986ce69f7ad260f0d/manifest.json
maorfr commented 5 years ago

I want to assume this is an issue with minio, but can't verify at this time...

alexbarta commented 5 years ago

well using k8s:// same result I guess is something that happens during the PerformCopy stuff

Bfoster-melrok commented 3 years ago

Is this project still active? I seem to be having this same issue writing from cassandra cluster on eks to s3. Tried multiple times and it gets stuck at random parts each time.