k8ssandra / k8ssandra-operator

The Kubernetes operator for K8ssandra
https://k8ssandra.io/
Apache License 2.0
157 stars 73 forks source link

number of racks in source != racks in destination #1063

Open Mokto opened 11 months ago

Mokto commented 11 months ago

What happened? After updating to 1.8.1, and running a restore from a backup (I tried full & differential), k8ssandra-operator complains the number of rack is different between the source & the destination.

Did you expect to see something different?

Restore should work. It used to run properly on 1.6.1

How to reproduce it (as minimally and precisely as possible):

kubectl get cassandradatacenter -n databases-production-cassandra-green -o=jsonpath='{.items[0].spec.racks}'

for both clusters return the same value

Environment

My backup job that I'm trying to restore

apiVersion: medusa.k8ssandra.io/v1alpha1
kind: MedusaBackupJob
metadata:
  creationTimestamp: "2023-09-23T19:05:54Z"
  generation: 1
  labels:
    argocd.argoproj.io/instance: databases-production-cassandra
  name: medusa-backup230923-full
  namespace: databases-production-cassandra
  ownerReferences:
  - apiVersion: cassandra.datastax.com/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: CassandraDatacenter
    name: dc1
    uid: 93b04818-eab1-42bf-a6eb-c5a0dbd855b7
  resourceVersion: "1777665157"
spec:
  backupType: full
  cassandraDatacenter: dc1
status:
  finishTime: "2023-09-23T19:07:54Z"
  finished:
  - production-dc1-default-sts-5
  - production-dc1-default-sts-4
  - production-dc1-default-sts-1
  - production-dc1-default-sts-6
  - production-dc1-default-sts-3
  - production-dc1-default-sts-2
  - production-dc1-default-sts-0
  startTime: "2023-09-23T19:06:09Z"
2023-09-24T06:19:46.893Z    ERROR   Failed to prepare restore   {"controller": "medusarestorejob", "controllerGroup": "medusa.k8ssandra.io", "controllerKind": "MedusaRestoreJob", "MedusaRestoreJob": {"name":"restore-backup2","namespace":"databases-production-cassandra-green"}, "namespace": "databases-production-cassandra-green", "name": "restore-backup2", "reconcileID": "22847498-1698-49a9-8c75-8c200e840e7b", "medusarestorejob": "databases-production-cassandra-green/restore-backup2", "error": "number of racks in source != racks in destination"}github.com/k8ssandra/k8ssandra-operator/controllers/medusa.(*MedusaRestoreJobReconciler).Reconcile    /workspace/controllers/medusa/medusarestorejob_controller.go:111sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile  /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:122sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler   /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:323sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:274sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2  /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:235

Anything else we need to know?: Just ask.

Thanks!

┆Issue is synchronized with this Jira Story by Unito

burmanm commented 11 months ago

Can you output the MedusaBackup object you have?

adejanovski commented 11 months ago

I'm also wondering if the tokenmap files in the backups contain the dc and rack placements. These were added at some point along the way sometime last year, and is now used to perform the mappings. If the backups were taken before we added this, then they won't have the info that's needed to perform the restore. Do you have access to the backup in the object storage and could check the contents of the meta/tokenmap.json file for this backup? Does it contain datacenter and rack entries for each node? Do you know which version of Medusa was used to take this backup?

Mokto commented 11 months ago

I used 1.8.1 to do the backup, I'm not sure which medusa version was associated.

Actually I think I figured it out. The dc name was different even though the configuration was exactly the same.

When I used the same DC name in another cluster, it started working.

Could it be something like that?

My medusabackup:

apiVersion: medusa.k8ssandra.io/v1alpha1
kind: MedusaBackup
metadata:
  creationTimestamp: "2023-09-23T19:07:54Z"
  generation: 1
  name: medusa-backup230923-full
  namespace: databases-production-cassandra
  resourceVersion: "1777665156"
  uid: 9360861b-929f-4040-856e-91d8928674ee
spec:
  backupType: full
  cassandraDatacenter: dc1
status:
  finishTime: "2023-09-23T19:07:54Z"
  startTime: "2023-09-23T19:06:09Z"
adejanovski commented 11 months ago

yes, the error message is probably not the right one. You cannot restore a backup from a dc named differently than the one you're restoring to because the schema has replication settings that contain the dc name. Since we're restoring the schema sstables as is, we can't modify the replication settings prior to the restore. The error message should explain better what's the problem though, we'll get that addressed. @rzvoncek, feeling like picking this one up?