Closed chandapukiran closed 8 months ago
Hi @chandapukiran! Sorry to hear you're having troubles. I'd need to know what C* version was the cluster you did the backup of, and what is the version of a cluster you are restoring it to. Could you please share those?
Hi @rzvoncek i have tested it on Cassandra version 4.1, so basically the cluster is same; i just take a backup and prune the data and try to restore
Great, thanks. So let's try to narrow this issue down. Let's start by checking if the corrupt file is actually okay in the backup storage. Here's what I'd like you to do:
-Data.db
).Then, use the following docker-compose file to spin up a Cassandra node in Docker:
version: '3'
services:
cassandra:
image: cassandra:4.1.2
volumes:
- ./volumes/cassandra:/var/lib/cassandra
volumes:
cassandra_data:
docker-compose up
../volumes/cassandra/data/system/local-7ad54392bcdd35a684174e047860b377
(the ID will be different for you) in the folder next to the docker-compose file.docker-compose down
(or ctrl+c the up command).docker-compose up
.These are the files i have downloaded from gcp storage and copied to the above location
nb-1-big-Data.db nb-1-big-Index.db nb-1-big-TOC.txt nb-1-big-Digest.crc32 nb-1-big-CompressionInfo.db nb-1-big-Summary.db nb-1-big-Filter.db nb-1-big-Statistics.db
Okay, thanks. So it seems like we can exclude the option that the file is corrupt in the storage. Even though, "one pod" is still a bit vague. To be throughout, you'd need to download the backed up data of the failing pod. I'm not sure that's what you did, and I don't want to waste more of your time.
Perhaps we can then move on to the second step. The second step is about the fact that it's unlikely you or your app are writing anything into system/local, so the error probably comes in by default. You even mention the problem is reproducible.
Could you please share the steps to reproduce the issue? It'll make it easier for us to debug the thing.
apiVersion: k8ssandra.io/v1alpha1
kind: K8ssandraCluster
metadata:
name: demo
spec:
cassandra:
serverVersion: "4.0.1"
datacenters:
- metadata:
name: dc1
size: 3
storageConfig:
cassandraDataVolumeClaimSpec:
storageClassName: k8ssandra-poc
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi
config:
jvmOptions:
heapSize: 512M
medusa:
storageProperties:
# Can be either of local, google_storage, azure_blobs, s3, s3_compatible, s3_rgw or ibm_storage
storageProvider: google_storage
# Name of the secret containing the credentials file to access the backup storage backend
storageSecretRef:
name: medusa-bucket-key
# Name of the storage bucket
bucketName: medusatest # changed the name
# Prefix for this cluster in the storage bucket directory structure, used for multitenancy
prefix: test
region: us-east4
kubectl exec -it demo-dc1-default-sts-0 -c cassandra -- cqlsh -u $CASS_USERNAME -p $CASS_PASSWORD -e "CREATE TABLE test.users (email text primary key, name text, state text);"
kubectl exec -it demo-dc1-default-sts-0 -c cassandra -- cqlsh -u $CASS_USERNAME -p $CASS_PASSWORD -e "insert into test.users (email, name, state) values ('john@gamil.com', 'John Smith', 'NC');"
kubectl exec -it demo-dc1-default-sts-0 -c cassandra -- cqlsh -u $CASS_USERNAME -p $CASS_PASSWORD -e "select * from test.users;"`
kubectl exec -it demo-dc1-default-sts-0 -c cassandra -- cqlsh -u $CASS_USERNAME -p $CASS_PASSWORD -e "TRUNCATE TABLE test.users;"
Please do let me know if you like me to test anything else or need any other information.
Hi @chandapukiran, I'm sorry but I cannot reproduce the issue. I've tried k8ssandra-operator versions 1.5.2, 1.7.0 and 1.80. I've done this with 4.0.1 and 4.1.1 Cassandra version.
The only lead I can imagine is your storage class. You're using k8ssandra-poc
which I'm not familiar with. In my case, I used a GKE cluster and a standard class.
Could you please (re)try with a standard class. Otherwise the only thing I may suggest is looking into the storage class itself.
Hi @rzvoncek thank you, k8ssandra-poc is just the name of the storageclass and I am using gce-pd and for AWS i use gp2. I am quite sure about the issue because at least in AWS the medusa backup/restore worked for me before but only seeing the issue. I will check it again with my storageclass and let you know.
@rzvoncek Now it is working for me, i have tested k8ssandra-operator 1.5.2 with Medusa 0.13.4 . I was using the higher version of Medusa with k8ssandra-operator 1.5.2 which was causing one of the issues.
Also, I was making the mistake of not deleting the old backups when testing with creating a new test k8ssandra cluster and that was the reason why the data corruption issue I faced.
Thanks for your help
Project board link
Hi, i am facing some strange issues with medusa restore; after restoring the backup the statefulset restarts and the pods get stuck at 2/3; and the server-system-logger throws an error with corruption-detected messages.
The backup and restore were working perfectly fine a couple of weeks when I tested Medusa but they are not working now.
The issue is reproducible on both GCP GKE and AWS Kops clusters. The issue is seen for both new backups and old backups. I have tested both Medusa official image and the internally build image with our own base image.
k8ssandra-operator versions tried - 1.5.2, 1.7.0 and 1.8.0
Below is the error message
NAME READY STATUS RESTARTS AGE pod/demo-dc1-default-sts-0 2/3 Running 0 5h15m pod/demo-dc1-default-sts-1 2/3 Running 0 5h15m pod/demo-dc1-default-sts-2 2/3 Running 0 5h15m pod/medusa-poc-cass-operator-588778c6cf-tnfwm 1/1 Running 0 5h33m pod/medusa-poc-k8ssandra-operator-5bf6fdfddb-pl2v5 1/1 Running 0 5h33
INFO [main] 2023-07-24 05:23:41,256 ColumnFamilyStore.java:385 - Initializing system.IndexInfo INFO [main] 2023-07-24 05:23:42,355 ColumnFamilyStore.java:385 - Initializing system.batches INFO [main] 2023-07-24 05:23:42,362 ColumnFamilyStore.java:385 - Initializing system.paxos INFO [main] 2023-07-24 05:23:42,374 ColumnFamilyStore.java:385 - Initializing system.local INFO [SSTableBatchOpen:1] 2023-07-24 05:23:42,437 BufferPools.java:49 - Global buffer pool limit is 122.000MiB for chunk-cache and 30.000MiB for networking INFO [SSTableBatchOpen:1] 2023-07-24 05:23:42,460 SSTableReaderBuilder.java:351 - Opening /var/lib/cassandra/data/system/local-7ad54392bcdd35a684174e047860b377/nb-6-big (0.622KiB) INFO [SSTableBatchOpen:1] 2023-07-24 05:23:42,487 SSTableReaderBuilder.java:351 - Opening /var/lib/cassandra/data/system/local-7ad54392bcdd35a684174e047860b377/nb-7-big (0.049KiB) INFO [main] 2023-07-24 05:23:42,492 CacheService.java:100 - Initializing key cache with capacity of 24 MBs. INFO [main] 2023-07-24 05:23:42,507 CacheService.java:122 - Initializing row cache with capacity of 0 MBs INFO [main] 2023-07-24 05:23:42,509 CacheService.java:151 - Initializing counter cache with capacity of 12 MBs INFO [main] 2023-07-24 05:23:42,511 CacheService.java:162 - Scheduling counter cache save to every 7200 seconds (going to save all keys). INFO [main] 2023-07-24 05:23:42,601 ColumnFamilyStore.java:385 - Initializing system.peers_v2 INFO [SSTableBatchOpen:1] 2023-07-24 05:23:42,604 SSTableReaderBuilder.java:351 - Opening /var/lib/cassandra/data/system/peers_v2-c4325fbb8e5e3bafbd070f9250ed818e/nb-1-big (0.896KiB) INFO [main] 2023-07-24 05:23:42,611 ColumnFamilyStore.java:385 - Initializing system.peers INFO [SSTableBatchOpen:1] 2023-07-24 05:23:42,613 SSTableReaderBuilder.java:351 - Opening /var/lib/cassandra/data/system/peers-37f71aca7dc2383ba70672528af04d4f/nb-1-big (0.877KiB) INFO [main] 2023-07-24 05:23:42,619 ColumnFamilyStore.java:385 - Initializing system.peer_events_v2 INFO [main] 2023-07-24 05:23:42,623 ColumnFamilyStore.java:385 - Initializing system.peer_events INFO [main] 2023-07-24 05:23:42,628 ColumnFamilyStore.java:385 - Initializing system.compaction_history INFO [SSTableBatchOpen:1] 2023-07-24 05:23:42,631 SSTableReaderBuilder.java:351 - Opening /var/lib/cassandra/data/system/compaction_history-b4dbb7b4dc493fb5b3bfce6e434832ca/nb-1-big (0.098KiB) INFO [main] 2023-07-24 05:23:42,637 ColumnFamilyStore.java:385 - Initializing system.sstable_activity INFO [SSTableBatchOpen:1] 2023-07-24 05:23:42,640 SSTableReaderBuilder.java:351 - Opening /var/lib/cassandra/data/system/sstable_activity-5a1ff267ace03f128563cfae6103c65e/nb-1-big (0.143KiB) INFO [main] 2023-07-24 05:23:42,646 ColumnFamilyStore.java:385 - Initializing system.size_estimates INFO [SSTableBatchOpen:1] 2023-07-24 05:23:42,649 SSTableReaderBuilder.java:351 - Opening /var/lib/cassandra/data/system/size_estimates-618f817b005f3678b8a453f3930b8e86/nb-1-big (2.657KiB) INFO [main] 2023-07-24 05:23:42,654 ColumnFamilyStore.java:385 - Initializing system.table_estimates INFO [SSTableBatchOpen:1] 2023-07-24 05:23:42,657 SSTableReaderBuilder.java:351 - Opening /var/lib/cassandra/data/system/table_estimates-176c39cdb93d33a5a2188eb06a56f66e/nb-1-big (5.877KiB) INFO [main] 2023-07-24 05:23:42,663 ColumnFamilyStore.java:385 - Initializing system.available_ranges_v2 INFO [main] 2023-07-24 05:23:42,667 ColumnFamilyStore.java:385 - Initializing system.available_ranges INFO [main] 2023-07-24 05:23:42,671 ColumnFamilyStore.java:385 - Initializing system.transferred_ranges_v2 INFO [main] 2023-07-24 05:23:42,675 ColumnFamilyStore.java:385 - Initializing system.transferred_ranges INFO [main] 2023-07-24 05:23:42,680 ColumnFamilyStore.java:385 - Initializing system.view_builds_in_progress INFO [main] 2023-07-24 05:23:42,683 ColumnFamilyStore.java:385 - Initializing system.built_views INFO [main] 2023-07-24 05:23:42,687 ColumnFamilyStore.java:385 - Initializing system.prepared_statements INFO [main] 2023-07-24 05:23:42,691 ColumnFamilyStore.java:385 - Initializing system.repairs INFO [main] 2023-07-24 05:23:42,742 QueryProcessor.java:106 - Initialized prepared statement caches with 10 MB ERROR [main] 2023-07-24 05:23:42,919 CassandraDaemon.java:909 - Exception encountered during startup org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: /var/lib/cassandra/data/system/local-7ad54392bcdd35a684174e047860b377/nb-6-big-Data.db at org.apache.cassandra.io.util.CompressedChunkReader$Mmap.readChunk(CompressedChunkReader.java:229) at org.apache.cassandra.io.util.BufferManagingRebufferer.rebuffer(BufferManagingRebufferer.java:79) at org.apache.cassandra.io.util.RandomAccessReader.reBufferAt(RandomAccessReader.java:68) at org.apache.cassandra.io.util.RandomAccessReader.seek(RandomAccessReader.java:210) at org.apache.cassandra.io.util.FileHandle.createReader(FileHandle.java:151) at org.apache.cassandra.io.sstable.format.SSTableReader.getFileDataInput(SSTableReader.java:1585) at org.apache.cassandra.db.columniterator.AbstractSSTableIterator.<init>(AbstractSSTableIterator.java:96) at org.apache.cassandra.db.columniterator.SSTableIterator.<init>(SSTableIterator.java:48) at org.apache.cassandra.io.sstable.format.big.BigTableReader.iterator(BigTableReader.java:75) at org.apache.cassandra.io.sstable.format.big.BigTableReader.iterator(BigTableReader.java:67) at org.apache.cassandra.db.StorageHook$1.makeRowIterator(StorageHook.java:87) at org.apache.cassandra.db.SinglePartitionReadCommand.queryMemtableAndSSTablesInTimestampOrder(SinglePartitionReadCommand.java:888) at org.apache.cassandra.db.SinglePartitionReadCommand.queryMemtableAndDiskInternal(SinglePartitionReadCommand.java:596) at org.apache.cassandra.db.SinglePartitionReadCommand.queryMemtableAndDisk(SinglePartitionReadCommand.java:569) at org.apache.cassandra.db.SinglePartitionReadCommand.queryStorage(SinglePartitionReadCommand.java:403) at org.apache.cassandra.db.ReadCommand.executeLocally(ReadCommand.java:377) at org.apache.cassandra.db.SinglePartitionReadQuery$Group.executeLocally(SinglePartitionReadQuery.java:242) at org.apache.cassandra.db.SinglePartitionReadQuery$Group.executeInternal(SinglePartitionReadQuery.java:216) at org.apache.cassandra.cql3.statements.SelectStatement.executeInternal(SelectStatement.java:447) at org.apache.cassandra.cql3.statements.SelectStatement.executeLocally(SelectStatement.java:431) at org.apache.cassandra.cql3.statements.SelectStatement.executeLocally(SelectStatement.java:88) at org.apache.cassandra.cql3.QueryProcessor.executeInternal(QueryProcessor.java:323) at org.apache.cassandra.db.SystemKeyspace.checkHealth(SystemKeyspace.java:973) at org.apache.cassandra.service.StartupChecks$10.execute(StartupChecks.java:442) at org.apache.cassandra.service.StartupChecks.verify(StartupChecks.java:132) at org.apache.cassandra.service.CassandraDaemon.runStartupChecks(CassandraDaemon.java:487) at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:262) at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:763) at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:887) Caused by: org.apache.cassandra.io.compress.CorruptBlockException: (/var/lib/cassandra/data/system/local-7ad54392bcdd35a684174e047860b377/nb-6-big-Data.db): corruption detected, chunk at 0 of length 627. at org.apache.cassandra.io.util.CompressedChunkReader$Mmap.readChunk(CompressedChunkReader.java:221) ... 28 common frames omitted Caused by: org.apache.cassandra.io.compress.CorruptBlockException: (/var/lib/cassandra/data/system/local-7ad54392bcdd35a684174e047860b377/nb-6-big-Data.db): corruption detected, chunk at 0 of length 627. at org.apache.cassandra.io.util.CompressedChunkReader$Mmap.readChunk(CompressedChunkReader.java:209) ... 28 common frames omitted