Open c3-clement opened 6 months ago
Note that after the issue happens, I can delete the k8ssandra cluster, re-create it and re-create a medusa restore job. Restoration will succeeded and the 3 Cassandra pods will come up.
This is make me think that something is going wrong in the restoration rather than in the backup.
Hello @c3-clement! I've spent some time trying to reproduce this using AWS EKS cluster, but with minio instead of S3. I'm sorry to conculde I did not manage to reproduce the issue.
I only saw two things that were a bit odd. First, after restore, the new nodes attempted to gossip with one or two of the old ones. This can be explained by the operator restoring the system.peers
table.
Second, I did see two more conditions for my datacenter (the oldest ones), but I can't imagine how how those would impact the restore.
What happened?
On AWS (EKS and S3 Bucket), the Cassandra Statefulset won't come up sometimes after restoration. On a K8ssandra cluster with 1 dc of 3 nodes, 2 are not reaching readiness probe.
On failed pods, the following error message is showing up in logs:
I observer this issue many times in our AWS environments and I never seen it in our GCP and Azure environments. It seems AWS specific.
Did you expect to see something different?
I expect the Cassandra Statefulset to be up and running after restoration.
How to reproduce it (as minimally and precisely as possible):
MedusaBackupJob
MedusaRestoreJob
The issue is not happening consistently. I would say it happens on 1 out of 5 attempts.
Environment
k8ssandra.yaml:
medusarestorejob.yaml :
medusabackup.yaml:
Anything else we need to know?:
k8scass-cs-001-k8scass-001-default-sts-1
medusa-restore logs: medusa-restore-logs.txtk8scass-cs-001-k8scass-001-default-sts-1
server-system-logger logs: server-system-logger.txt┆Issue is synchronized with this Jira Story by Unito ┆Issue Number: K8OP-22