Open alexbarta opened 5 years ago
Could this be that these are large files? Can you check in k8s if the file size changes during the copy?
well the schema has only a few records since I'm trying on a test minimal installation, if a do a du on the minio folder the total is less than 13M
du -skh minio/storage/db-backup/cassandra/cfs/thingsboard-cluster/thingsboard/e9bce4/20190112212203/*
3.4M minio/storage/db-backup/cassandra/cfs/thingsboard-cluster/thingsboard/e9bce4/20190112212203/cassandra-0
3.4M minio/storage/db-backup/cassandra/cfs/thingsboard-cluster/thingsboard/e9bce4/20190112212203/cassandra-1
3.4M minio/storage/db-backup/cassandra/cfs/thingsboard-cluster/thingsboard/e9bce4/20190112212203/cassandra-2
btw I'm running kubernetes v1.11.5 with flannel host-gw setup
on cassandra-0 the total data amounts to:
sudo du -skh /mnt/disks/cassandra/data/thingsboard
9.4M /mnt/disks/cassandra/data/thingsboard
same on the other nodes
Lets try to narrow this down. Can you try to do a copy from minio to k8s using skbn?
Sure give a minute
ok skbn seems to be working properly
created a file on minio
sudo dd if=/dev/zero of=/mnt/NAS/repo/mounts/minio/storage/db-backup/abigfile bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 2.50469 s, 429 MB/s
Run skbn to copy that file into k8s container
kubectl run cassandra-restore --rm --serviceaccount='cassandra-backup' -i --tty --restart=Never --image-pull-policy=IfNotPresent --image nuvo/skbn --env 'AWS_ACCESS_KEY_ID=admin' --env 'AWS_SECRET_ACCESS_KEY=*****' --env 'AWS_S3_NO_SSL=true' --env 'AWS_S3_FORCE_PATH_STYLE=true' --env 'AWS_S3_ENDPOINT=http://minio-svc.cfs.svc.cluster.local:9000' --command -- sh
If you don't see a command prompt, try pressing enter.
~ $
~ $ skbn cp --src s3://db-backup/abigfile --dst k8s://cfs/cassandra-0/cassandra/cassandra_data
2019/01/13 16:02:54 [1/1] copy: s3://db-backup/abigfile -> k8s://cfs/cassandra-0/cassandra/cassandra_data
2019/01/13 16:03:00 [1/1] done: s3://db-backup/abigfile -> k8s://cfs/cassandra-0/cassandra/cassandra_data
Check the file copied
md5sum /mnt/disks/cassandra/stdin
cd573cfaace07e7949bc0c46028904ff /mnt/disks/cassandra/stdin
md5sum /mnt/NAS/repo/mounts/minio/storage/db-backup/abigfile
cd573cfaace07e7949bc0c46028904ff /mnt/NAS/repo/mounts/minio/storage/db-backup/abigfile
Maor, looking at your skbn PerformCopy code
https://github.com/nuvo/skbn/blob/42781bdb9d5cd81fcda5a6ac44a17e0480fb0e94/pkg/skbn/skbn.go#L139
I see you are using nio buffers, maybe the hang process is due to some race condition provoked by the goroutines pipew and piper. Probably converting piper goroutine to a standard function could be a good test to see if that is the cause..
When cain gets stuck I can only see "copy:" log output, the instead "done:" never appears.
What do you think?
These routines are running concurrently, allowing copy to be done using a pipe. This has to be 2 goroutines...
See nuvo/skbn#3 for details
Then the stuck is either in Download/Upload functions..
Probably in download. Can you try the same again, but with a file that gets stuck?
Unfortunately is not a particular file, when running cain it randomly stops every time on different ( very small ) files. Only a couple of times It did finish the job.
Funny thing is backup that runs 2x faster and it never gets stuck
this is a short gif of the stuck
If minio is a pod in the cluster, you can try treating it as k8s://... Give it a shot, as a work around :)
Cool idea ! I will try thanks
no luck I got stuck here this time :(
cain restore --src 'k8s://cfs/minio-deployment-6655ffc669-ph868/minio/storage/db-backup/cassandra/cfs/thingsboard-cluster' -n cfs -k thingsboard -t
20190112212203 --cassandra-data-dir /cassandra_data/data --buffer-size 1 -l app=cassandra
...
2019/01/13 17:37:18 [0372/1674] copy: k8s://cfs/minio-deployment-6655ffc669-ph868/minio/storage/db-backup/cassandra/cfs/thingsboard-cluster/thingsboard/e9bce4/20190112212203/cassandra-0/event_by_id/manifest.json -> k8s://cfs/cassandra-0/cassandra/cassandra_data/data/thingsboard/event_by_id-42d57b20174511e986ce69f7ad260f0d/manifest.json
I want to assume this is an issue with minio, but can't verify at this time...
well using k8s:// same result I guess is something that happens during the PerformCopy stuff
Is this project still active? I seem to be having this same issue writing from cassandra cluster on eks to s3. Tried multiple times and it gets stuck at random parts each time.
Hi Maor,
I'm trying to restore, but almost every time some file gets stuck during the copy and cain 0.5.1 hangs.
I tried to do some tunings with buffer size/parallelism but no success, cain randomly gets stuck at certain file copy. In this state after a while, the tcp connection towards minio disappears from netstat output, but cain remains still alive.
Any idea to increase verbosity of the copy process ?
Regards