scylladb / scylladb

NoSQL data store using the seastar framework, compatible with Apache Cassandra
http://scylladb.com
GNU Affero General Public License v3.0
13.61k stars 1.3k forks source link

Compacting of sstables interrupted due to `sstables::malformed_sstable_exception`: Checksummed chunk of size 0 at file offset #21650

Open juliayakovlev opened 14 hours ago

juliayakovlev commented 14 hours ago

Packages

Scylla version: 6.3.0~dev-20241119.733a4f94c7b2 with build-id 89a16d00ef82b3c15eb73e944afc60f15c478452 Kernel Version: 6.8.0-1019-aws

Issue description

sstables::malformed_sstable_exception while write load:

Nov 20 14:50:49.602174 perf-latency-nemesis-ubuntu-db-node-1621284f-3 scylla[6311]:  [shard 4:comp] compaction - [Compact keyspace1.standard1 d5155f10-a74e-11ef-88af-0a57f7f33336] Compacting
 [/var/lib/scylla/data/keyspace1/standard1-97756120a74c11ef9d200a5bf7f33336/me-3gle_154y_4i1c022tu7dhshimgm-big-Data.db:level=0:origin=memtable,/var/lib/scylla/data/keyspace1/standard1-97756
120a74c11ef9d200a5bf7f33336/me-3gle_1515_11q0w22tu7dhshimgm-big-Data.db:level=0:origin=memtable,/var/lib/scylla/data/keyspace1/standard1-97756120a74c11ef9d200a5bf7f33336/me-3gle_152e_5wels22
tu7dhshimgm-big-Data.db:level=0:origin=memtable,/var/lib/scylla/data/keyspace1/standard1-97756120a74c11ef9d200a5bf7f33336/me-3gle_153o_2n5xc22tu7dhshimgm-big-Data.db:level=0:origin=memtable,
/var/lib/scylla/data/keyspace1/standard1-97756120a74c11ef9d200a5bf7f33336/me-3gle_14zu_3r14022tu7dhshimgm-big-Data.db:level=0:origin=memtable,/var/lib/scylla/data/keyspace1/standard1-9775612
0a74c11ef9d200a5bf7f33336/me-3gle_156a_4d3v422tu7dhshimgm-big-Data.db:level=0:origin=memtable,/var/lib/scylla/data/keyspace1/standard1-97756120a74c11ef9d200a5bf7f33336/me-3gle_14yh_3uw0022tu
7dhshimgm-big-Data.db:level=0:origin=memtable,/var/lib/scylla/data/keyspace1/standard1-97756120a74c11ef9d200a5bf7f33336/me-3gle_14uj_2j3bk22tu7dhshimgm-big-Data.db:level=0:origin=memtable,/v
ar/lib/scylla/data/keyspace1/standard1-97756120a74c11ef9d200a5bf7f33336/me-3gle_14t9_5az0022tu7dhshimgm-big-Data.db:level=0:origin=memtable,/var/lib/scylla/data/keyspace1/standard1-97756120a
74c11ef9d200a5bf7f33336/me-3gle_14vu_0rndc22tu7dhshimgm-big-Data.db:level=0:origin=memtable,/var/lib/scylla/data/keyspace1/standard1-97756120a74c11ef9d200a5bf7f33336/me-3gle_14x5_27ink22tu7d
hshimgm-big-Data.db:level=0:origin=memtable]
Nov 20 14:51:35.009059 perf-latency-nemesis-ubuntu-db-node-1621284f-3 scylla[6311]:  [shard 6:comp] compaction - [Compact keyspace1.standard1 af7f7bf0-a74e-11ef-b015-0a59f7f33336] Compacting of 2 sstables interrupted due to: sstables::malformed_sstable_exception (Failed to read partition from SSTable /var/lib/scylla/data/keyspace1/standard1-97756120a74c11ef9d200a5bf7f33336/me-3gle_14vg_11alc2oebpia1otz5i-big-Data.db due to Checksummed chunk of size 0 at file offset 1415774208 failed checksum: expected=956923229, actual=0), at 0x6217fae 0x62185d0 0x62188d8 0x2574dda 0x257312e 0x60ecfe6
                                                                                       --------
                                                                                       --------
                                                                                       seastar::internal::coroutine_traits_base<sstables::compaction_result>::promise_type
                                                                                       --------
                                                                                       seastar::internal::coroutine_traits_base<std::optional<sstables::compaction_stats> >::promise_type
                                                                                       --------
                                                                                       seastar::shared_future<std::optional<sstables::compaction_stats> >::shared_state
Nov 20 14:51:35.010051 perf-latency-nemesis-ubuntu-db-node-1621284f-3 scylla[6311]:  [shard 6:comp] compaction_manager - Compaction task 0x6060043ad400 for table keyspace1.standard1 compaction_group=0 [0x606004d51a20]: failed: sstables::malformed_sstable_exception (Failed to read partition from SSTable /var/lib/scylla/data/keyspace1/standard1-97756120a74c11ef9d200a5bf7f33336/me-3gle_14vg_11alc2oebpia1otz5i-big-Data.db due to Checksummed chunk of size 0 at file offset 1415774208 failed checksum: expected=956923229, actual=0). Will retry in 5 seconds

2024-11-20T14:51:35.270+00:00 perf-latency-nemesis-ubuntu-db-node-1621284f-3     !INFO | scylla[6311]:  [shard 6:comp] compaction - [Compact keyspace1.standard1 af7f7bf0-a74e-11ef-b015-0a59f7f33336] 
Compacting of 2 sstables interrupted due to: sstables::malformed_sstable_exception (Failed to read partition from SSTable /var/lib/scylla/data/keyspace1/standard1-97756120a74c11ef9d200a5bf7f33336/me-3gle_14vg_11alc2oebpia1otz5i-big-Data.db due to Checksummed chunk of size 0 at file offset 1415774208 failed checksum: expected=956923229, actual=0), at 0x6217fae 0x62185d0 0x62188d8 0x2574dda 0x257312e 0x60ecfe6
Nov 20 14:49:29.917775 perf-latency-nemesis-ubuntu-db-node-1621284f-2 scylla[6305]:  [shard 3:comp] compaction - [Compact keyspace1.standard1 a59667c0-a74e-11ef-8621-d2c47dce7d22] Compacting
 [/var/lib/scylla/data/keyspace1/standard1-97756120a74c11ef9d200a5bf7f33336/me-3gle_1531_2kt1s21fjsxjwgrdia-big-Data.db:level=0:origin=memtable,/var/lib/scylla/data/keyspace1/standard1-97756
120a74c11ef9d200a5bf7f33336/me-3gle_14yz_2fvkw21fjsxjwgrdia-big-Data.db:level=0:origin=memtable,/var/lib/scylla/data/keyspace1/standard1-97756120a74c11ef9d200a5bf7f33336/me-3gle_151o_4lgsg21
fjsxjwgrdia-big-Data.db:level=0:origin=memtable,/var/lib/scylla/data/keyspace1/standard1-97756120a74c11ef9d200a5bf7f33336/me-3gle_14rg_21qbk21fjsxjwgrdia-big-Data.db:level=0:origin=memtable,
/var/lib/scylla/data/keyspace1/standard1-97756120a74c11ef9d200a5bf7f33336/me-3gle_14xp_3xob421fjsxjwgrdia-big-Data.db:level=0:origin=memtable,/var/lib/scylla/data/keyspace1/standard1-9775612
0a74c11ef9d200a5bf7f33336/me-3gle_150b_4vjg021fjsxjwgrdia-big-Data.db:level=0:origin=memtable,/var/lib/scylla/data/keyspace1/standard1-97756120a74c11ef9d200a5bf7f33336/me-3gle_154d_12sls21fj
sxjwgrdia-big-Data.db:level=0:origin=memtable,/var/lib/scylla/data/keyspace1/standard1-97756120a74c11ef9d200a5bf7f33336/me-3gle_14sn_1mihc21fjsxjwgrdia-big-Data.db:level=0:origin=memtable,/v
ar/lib/scylla/data/keyspace1/standard1-97756120a74c11ef9d200a5bf7f33336/me-3gle_14wg_2b5ts21fjsxjwgrdia-big-Data.db:level=0:origin=memtable,/var/lib/scylla/data/keyspace1/standard1-97756120a
74c11ef9d200a5bf7f33336/me-3gle_14v6_17q2o21fjsxjwgrdia-big-Data.db:level=0:origin=memtable,/var/lib/scylla/data/keyspace1/standard1-97756120a74c11ef9d200a5bf7f33336/me-3gle_14tw_1sq8w21fjsx
jwgrdia-big-Data.db:level=0:origin=memtable]
Nov 20 14:50:01.552497 perf-latency-nemesis-ubuntu-db-node-1621284f-2 scylla[6305]:  [shard 6:comp] compaction - [Compact keyspace1.standard1 82b851a0-a74e-11ef-8630-d2c07dce7d22] Compacting of 2 sstables interrupted due to: sstables::malformed_sstable_exception (Failed to read partition from SSTable /var/lib/scylla/data/keyspace1/standard1-97756120a74c11ef9d200a5bf7f33336/me-3gle_14s6_5ih2821gpdjlapa90i-big-Data.db due to Checksummed chunk of size 0 at file offset 1049427968 failed checksum: expected=3787296288, actual=0), at 0x6217fae 0x62185d0 0x62188d8 0x2574dda 0x257312e 0x60ecfe6
                                                                                       --------
                                                                                       seastar::continuation<seastar::internal::promise_base_with_type<sstables::compaction_result>, seastar::async<sstables::compaction::run(std::unique_ptr<sstables::compaction, std::default_delete<sstables::compaction> >)::$_0>(seastar::thread_attributes, sstables::compaction::run(std::unique_ptr<sstables::compaction, std::default_delete<sstables::compaction> >)::$_0&&)::{lambda()#2}, seastar::future<void>::then_impl_nrvo<seastar::async<sstables::compaction::run(std::unique_ptr<sstables::compaction, std::default_delete<sstables::compaction> >)::$_0>(seastar::thread_attributes, sstables::compaction::run(std::unique_ptr<sstables::compaction, std::default_delete<sstables::compaction> >)::$_0&&)::{lambda()#2}, seastar::future<sstables::compaction_result> >(sstables::compaction::run(std::unique_ptr<sstables::compaction, std::default_delete<sstables::compaction> >)::$_0&&)::{lambda(seastar::internal::promise_base_with_type<sstables::compaction_result>&&, seastar::async<sstables::compaction::run(std::unique_ptr<sstables::compaction, std::default_delete<sstables::compaction> >)::$_0>(seastar::thread_attributes, auto:1&&, (auto:2&&)...)::{lambda()#2}&, seastar::future_state<seastar::internal::monostate>&&)#1}, void>
                                                                                       --------
                                                                                       seastar::continuation<seastar::internal::promise_base_with_type<sstables::compaction_result>, seastar::future<sstables::compaction_result>::finally_body<seastar::async<sstables::compaction::run(std::unique_ptr<sstables::compaction, std::default_delete<sstables::compaction> >)::$_0>(seastar::thread_attributes, sstables::compaction::run(std::unique_ptr<sstables::compaction, std::default_delete<sstables::compaction> >)::$_0&&)::{lambda()#3}, false>, seastar::future<sstables::compaction_result>::then_wrapped_nrvo<seastar::future<sstables::compaction_result>, seastar::future<sstables::compaction_result>::finally_body<seastar::async<sstables::compaction::run(std::unique_ptr<sstables::compaction, std::default_delete<sstables::compaction> >)::$_0>(seastar::thread_attributes, sstables::compaction::run(std::unique_ptr<sstables::compaction, std::default_delete<sstables::compaction> >)::$_0&&)::{lambda()#3}, false> >(seastar::future<sstables::compaction_result>::finally_body<seastar::async<sstables::compaction::run(std::unique_ptr<sstables::compaction, std::default_delete<sstables::compaction> >)::$_0>(seastar::thread_attributes, sstables::compaction::run(std::unique_ptr<sstables::compaction, std::default_delete<sstables::compaction> >)::$_0&&)::{lambda()#3}, false>&&)::{lambda(seastar::internal::promise_base_with_type<sstables::compaction_result>&&, seastar::future<sstables::compaction_result>::finally_body<seastar::async<sstables::compaction::run(std::unique_ptr<sstables::compaction, std::default_delete<sstables::compaction> >)::$_0>(seastar::thread_attributes, auto:1&&, (auto:2&&)...)::{lambda()#3}, false>&, seastar::future_state<sstables::compaction_result>&&)#1}, sstables::compaction_result>
                                                                                       --------
                                                                                       seastar::internal::coroutine_traits_base<sstables::compaction_result>::promise_type
                                                                                       --------
                                                                                       seastar::internal::coroutine_traits_base<std::optional<sstables::compaction_stats> >::promise_type
                                                                                       --------
                                                                                       seastar::shared_future<std::optional<sstables::compaction_stats> >::shared_state
Nov 20 14:50:01.553234 perf-latency-nemesis-ubuntu-db-node-1621284f-2 scylla[6305]:  [shard 6:comp] compaction_manager - Compaction task 0x606003ee3800 for table keyspace1.standard1 compaction_group=0 [0x606004b54a20]: failed: sstables::malformed_sstable_exception (Failed to read partition from SSTable /var/lib/scylla/data/keyspace1/standard1-97756120a74c11ef9d200a5bf7f33336/me-3gle_14s6_5ih2821gpdjlapa90i-big-Data.db due to Checksummed chunk of size 0 at file offset 1049427968 failed checksum: expected=3787296288, actual=0). Will retry in 5 seconds
Nov 20 14:50:46.054386 perf-latency-nemesis-ubuntu-db-node-1621284f-2 scylla[6305]:  [shard 6:comp] compaction - [Compact keyspace1.standard1 82b89fc0-a74e-11ef-8630-d2c07dce7d22] Compacted 2 sstables to [/var/lib/scylla/data/keyspace1/standard1-97756120a74c11ef9d200a5bf7f33336/me-3gle_154v_2rg8w21gpdjlapa90i-big-Data.db:level=0]. 4GB to 4GB (~100% of original) in 134620ms = 30MB/s. ~3594880 total partitions merged to 3594803.

2024-11-20T14:50:02.042+00:00 perf-latency-nemesis-ubuntu-db-node-1621284f-2     !INFO | scylla[6305]:  [shard 6:comp] compaction - [Compact keyspace1.standard1 82b851a0-a74e-11ef-8630-d2c07dce7d22] 
Compacting of 2 sstables interrupted due to: sstables::malformed_sstable_exception (Failed to read partition from SSTable /var/lib/scylla/data/keyspace1/standard1-97756120a74c11ef9d200a5bf7f33336/me-3gle_14s6_5ih2821gpdjlapa90i-big-Data.db due to Checksummed chunk of size 0 at file offset 1049427968 failed checksum: expected=3787296288, actual=0), at 0x6217fae 0x62185d0 0x62188d8 0x2574dda 0x257312e 0x60ecfe6

Corrupted filed are collected and can be downloaded

Impact

Compacting failed, data corruption

How frequently does it reproduce?

Last successful run with Scylla version 6.3.0~dev-20241115.5bc03da0c4c1: https://argus.scylladb.com/tests/scylla-cluster-tests/378c37dd-51c6-4623-87f7-f5419a5d976f

Commits between 733a4f94c7b2 and 5bc03da0c4c1:

git log --oneline --first-parent

733a4f94c7 Merge 'test/boost/view_schema_test.cc: Wait for views to build in test_view_update_generating_writetime' from Dawid Mędrek
7607f5e33e alternator: fix "/localnodes" to not return down nodes
980f6a48ab .github/scripts/auto-backport.py: validate backport candidate with `Fixes` prefix
165902b951 conf/scylla.yaml: update documentation for enable_tablets
36870feb29 Merge 'test: route S3 Proxy server messages through logger' from Kefu Chai
b14871ad3f Merge 'code cleanup: remove "sstring_view" and replace its usages by std::string_view' from Nadav Har'El
06d478793d Merge 'mutation: switch from boost ranges to std ranges' from Avi Kivity
34d7a4401d ./github/workflows/conflict_reminder.yaml: fix assignee object
572b005774 repair: implement tablet_repair_task_impl::release_resources
bef015da0d Revert ".github/scripts/auto-backport.py: validate backport candidate with `Fixes` prefix"
3a6c0a9b36 Merge 'compaction: Perform integrity checks on compacting SSTables' from Nikos Dragazis
f23800181a Merge 'Align Metric Family Descriptions' from Amnon Heiman
5bc03da0c4 tools/scylla-nodetool: rename estimated_row_count to estimated_partition_count

Installation details

Cluster size: 3 nodes (i3en.2xlarge)

Scylla Nodes used in this run:

OS / Image: ami-0d782b5935da36d59 (aws: undefined_region)

Test: scylla-master-perf-regression-latency-650gb-with-nemesis Test id: 1621284f-c8ee-45d7-a758-2e9088b28e8f Test name: scylla-master/perf-regression/scylla-master-perf-regression-latency-650gb-with-nemesis Test method: performance_regression_test.PerformanceRegressionTest.test_latency_write_with_nemesis Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor 1621284f-c8ee-45d7-a758-2e9088b28e8f` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=1621284f-c8ee-45d7-a758-2e9088b28e8f) - Show all stored logs command: `$ hydra investigate show-logs 1621284f-c8ee-45d7-a758-2e9088b28e8f` ## Logs: - **db-cluster-1621284f.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/1621284f-c8ee-45d7-a758-2e9088b28e8f/20241120_145419/db-cluster-1621284f.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/1621284f-c8ee-45d7-a758-2e9088b28e8f/20241120_145419/db-cluster-1621284f.tar.gz) - **sct-runner-events-1621284f.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/1621284f-c8ee-45d7-a758-2e9088b28e8f/20241120_145419/sct-runner-events-1621284f.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/1621284f-c8ee-45d7-a758-2e9088b28e8f/20241120_145419/sct-runner-events-1621284f.tar.gz) - **sct-1621284f.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/1621284f-c8ee-45d7-a758-2e9088b28e8f/20241120_145419/sct-1621284f.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/1621284f-c8ee-45d7-a758-2e9088b28e8f/20241120_145419/sct-1621284f.log.tar.gz) - **loader-set-1621284f.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/1621284f-c8ee-45d7-a758-2e9088b28e8f/20241120_145419/loader-set-1621284f.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/1621284f-c8ee-45d7-a758-2e9088b28e8f/20241120_145419/loader-set-1621284f.tar.gz) - **monitor-set-1621284f.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/1621284f-c8ee-45d7-a758-2e9088b28e8f/20241120_145419/monitor-set-1621284f.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/1621284f-c8ee-45d7-a758-2e9088b28e8f/20241120_145419/monitor-set-1621284f.tar.gz) - **corrupt files** - [https://cloudius-jenkins-test.s3.us-east-1.amazonaws.com/1621284f-c8ee-45d7-a758-2e9088b28e8f/20241120_153138/corrupfiles_1621284f.tar.gz](https://cloudius-jenkins-test.s3.us-east-1.amazonaws.com/1621284f-c8ee-45d7-a758-2e9088b28e8f/20241120_153138/corrupfiles_1621284f.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-master/job/perf-regression/job/scylla-master-perf-regression-latency-650gb-with-nemesis/31/) [Argus](https://argus.scylladb.com/test/4f088858-cbf2-4984-9d44-d319c207f521/runs?additionalRuns[]=1621284f-c8ee-45d7-a758-2e9088b28e8f)
juliayakovlev commented 14 hours ago

Also same failure:

https://argus.scylladb.com/tests/scylla-cluster-tests/a07f086c-829f-422c-81bf-be872e7f2905 https://argus.scylladb.com/tests/scylla-cluster-tests/e1cb4e73-0f39-4bf6-b672-490597716033

mykaul commented 14 hours ago
3a6c0a9b36 Merge 'compaction: Perform integrity checks on compacting SSTables' from Nikos Dragazis

Sounds like a possible suspect. @ndragazis ?

ndragazis commented 12 hours ago

Looks like a premature EOF. The input stream expects to find more SSTable data than what is actually available. Not sure how we get there, will have to take a closer look.

How urgent is this? Also note that I don't have access to the corrupted files. Could you tell me what the file lengths are for the data and CRC components of the problematic SSTables?

juliayakovlev commented 12 hours ago

Looks like a premature EOF. The input stream expects to find more SSTable data than what is actually available. Not sure how we get there, will have to take a closer look.

How urgent is this? Also note that I don't have access to the corrupted files. Could you tell me what the file lengths are for the data and CRC components of the problematic SSTables?

@ndragazis I changed permissions for this tar. Should be available now. Please, try again

ndragazis commented 12 hours ago

Yes, I have access now. I see that we only collect the data component, not the whole SSTable. This is a problem in this case because uncompressed SSTables store the checksums and data in separate files. So, I'll have to reproduce it.

mykaul commented 12 hours ago

Yes, I have access now. I see that we only collect the data component, not the whole SSTable. This is a problem in this case because uncompressed SSTables store the checksums and data in separate files. So, I'll have to reproduce it.

Are those uncompressed sstables?

juliayakovlev commented 12 hours ago

Yes, I have access now. I see that we only collect the data component, not the whole SSTable. This is a problem in this case because uncompressed SSTables store the checksums and data in separate files. So, I'll have to reproduce it.

@ndragazis I can re-run the test for you and keep the cluster

juliayakovlev commented 12 hours ago

Yes, I have access now. I see that we only collect the data component, not the whole SSTable. This is a problem in this case because uncompressed SSTables store the checksums and data in separate files. So, I'll have to reproduce it.

Are those uncompressed sstables?

I collected SStables that in error message

denesb commented 11 hours ago

Yes, I have access now. I see that we only collect the data component, not the whole SSTable. This is a problem in this case because uncompressed SSTables store the checksums and data in separate files. So, I'll have to reproduce it.

Are those uncompressed sstables?

I collected SStables that in error message

When collecting sstables, please always collect all components, not just the Data one. The Data component alone is never enough to diagnose a problem related to sstables.

ndragazis commented 11 hours ago

Are those uncompressed sstables?

Yes, we can tell from the fact that the "checksummed" data source is used. For compressed SSTables we use the "compressed" data source.

raphaelsc commented 11 hours ago

Are those uncompressed sstables?

Yes, we can tell from the fact that the "checksummed" data source is used. For compressed SSTables we use the "compressed" data source.

@ndragazis I'd look for some logical error like memory corruption, given that it has been reproduced more than once. I don't think it's an actual corruption (but who knows...), I think we can verify that by checking if CRC indeed has zeroed chunk size. Have we modified the checksummed data sink in your series? Just the read path, right?

raphaelsc commented 11 hours ago

@ndragazis why are we reading checksum metadata on every reader? I think it would be the best to read it once when loading the sstable and reuse it for every read, since metadata is immutable. that can be the cause of perf drop for reading uncompressed files.

as for the corruption, I have a theory: the data source is outliving the reader somehow (we have seen such problems before), and at that point, the former holds a stale reference to checksum. but let's confirm we don't have a corruption first, by verifying the sstable reported as bad is actually bad (scylla sstable tool can be used to verify the integrity of files, there's a checksum verification command).

    integrity_check _integrity;
    lw_shared_ptr<checksum> _checksum;

    // For reversed (single partition) reads, points to the current position in the sstable
    // of the reversing data source used underneath (see `partition_reversing_data_source`).
@@ -1279,7 +1281,8 @@ class mx_sstable_mutation_reader : public mp_row_consumer_reader_mx {
                            tracing::trace_state_ptr trace_state,
                            streamed_mutation::forwarding fwd,
                            mutation_reader::forwarding fwd_mr,
                            read_monitor& mon)
                            read_monitor& mon,
                            integrity_check integrity)
            : mp_row_consumer_reader_mx(std::move(schema), permit, std::move(sst))
            , _slice_holder(std::move(slice))
            , _slice(_slice_holder.get())
@@ -1291,7 +1294,8 @@ class mx_sstable_mutation_reader : public mp_row_consumer_reader_mx {
            , _pr(pr)
            , _fwd(fwd)
            , _fwd_mr(fwd_mr)
            , _monitor(mon) {
            , _monitor(mon)
            , _integrity(integrity) {
        if (reversed()) {
            if (!_single_partition_read) {
                on_internal_error(sstlog, format(
@@ -1543,21 +1547,33 @@ class mx_sstable_mutation_reader : public mp_row_consumer_reader_mx {

        sstlog.trace("sstable_reader: {}: data file range [{}, {})", fmt::ptr(this), begin, *end);

        if (_integrity) {
            // Caller must retain a reference to checksum component while in use by the stream.
            _checksum = co_await _sst->read_checksum();
mykaul commented 10 hours ago

@juliayakovlev - can you check if the test actually creates uncompressed sstables?

juliayakovlev commented 10 hours ago

@ndragazis

The issue is reproduced https://argus.scylladb.com/tests/scylla-cluster-tests/2984a530-6877-49a0-9b7c-79aeaa665cbf The cluster is alive.

The failure is on the perf-latency-nemesis-ubuntu-db-node-2984a530-3 (54.246.35.44) node. The SSTable is /var/lib/scylla/data/keyspace1/standard1-a5a29320a80611ef984ae2d13baa59f4/me-3glf_0zvj_1fva82i2dvfg17nn1d-big-Data.db

2024-11-21T13:01:20.379+00:00 perf-latency-nemesis-ubuntu-db-node-2984a530-3     !INFO | scylla[6338]:  [shard 6:comp] compaction - [Compact keyspace1.standard1 a05478f0-a808-11ef-a484-43bf07a89f61] 
Compacting of 3 sstables interrupted due to: sstables::malformed_sstable_exception (Failed to read partition from SSTable /var/lib/scylla/data/keyspace1/standard1-a5a29320a80611ef984ae2d13baa59f4/me-3glf_0zvj_1fva82i2dvfg17nn1d-big-Data.db due to Checksummed chunk of size 0 at file offset 332791808 failed checksum: expected=3035332331, actual=0), at 0x6217fae 0x62185d0 0x62188d8 0x2574dda 0x257312e 0x60ecfe6

2024-11-21T13:00:37.879+00:00 perf-latency-nemesis-ubuntu-db-node-2984a530-3     !INFO | scylla[6338]:  [shard 6:comp] compaction - [Compact keyspace1.standard1 8652c1f0-a808-11ef-a484-43bf07a89f61] 
Compacting of 3 sstables interrupted due to: sstables::malformed_sstable_exception (Failed to read partition from SSTable /var/lib/scylla/data/keyspace1/standard1-a5a29320a80611ef984ae2d13baa59f4/me-3glf_0zvj_1fva82i2dvfg17nn1d-big-Data.db due to Checksummed chunk of size 0 at file offset 332791808 failed checksum: expected=3035332331, actual=0), at 0x6217fae 0x62185d0 0x62188d8 0x2574dda 0x257312e 0x60ecfe6

2024-11-21T12:59:59.131+00:00 perf-latency-nemesis-ubuntu-db-node-2984a530-3     !INFO | scylla[6338]:  [shard 6:comp] compaction - [Compact keyspace1.standard1 6911be70-a808-11ef-a484-43bf07a89f61] 
Compacting of 3 sstables interrupted due to: sstables::malformed_sstable_exception (Failed to read partition from SSTable /var/lib/scylla/data/keyspace1/standard1-a5a29320a80611ef984ae2d13baa59f4/me-3glf_0zvj_1fva82i2dvfg17nn1d-big-Data.db due to Checksummed chunk of size 0 at file offset 332791808 failed checksum: expected=3035332331, actual=0), at 0x6217fae 0x62185d0 0x62188d8 0x2574dda 0x257312e 0x60ecfe6
ndragazis commented 10 hours ago

Have we modified the checksummed data sink in your series? Just the read path, right?

No, changes affect only the read path.

@ndragazis why are we reading checksum metadata on every reader? I think it would be the best to read it once when loading the sstable and reuse it for every read, since metadata is immutable.

Not sure what you mean by "every reader" but it's not used in all read paths. I first injected this reader into the validation path (https://github.com/scylladb/scylladb/pull/20207), which is used only by scrub, and later on in compaction as well (https://github.com/scylladb/scylladb/pull/21153). Any other read operation on uncompressed SSTables will still use a raw input stream. So, I don't think there is any merit in caching the checksum metadata.

as for the corruption, I have a theory: the data source is outliving the reader somehow (we have seen such problems before), and at that point, the former holds a stale reference to checksum.

Well, from the logs we see the reader reaching to file offset 1415774208. Assuming a 64KiB chunk size, this is 21603 chunks that have passed the checksum verification. So far so good, and then we hit an unexpected EOF, which also happens to be perfectly aligned at the chunk boundary.

raphaelsc commented 9 hours ago

Have we modified the checksummed data sink in your series? Just the read path, right?

No, changes affect only the read path.

@ndragazis why are we reading checksum metadata on every reader? I think it would be the best to read it once when loading the sstable and reuse it for every read, since metadata is immutable.

Not sure what you mean by "every reader" but it's not used in all read paths. I first injected this reader into the validation path (#20207), which is used only by scrub, and later on in compaction as well (#21153). Any other read operation on uncompressed SSTables will still use a raw input stream. So, I don't think there is any merit in caching the checksum metadata.

That's what I missed, thanks for the explanation.

juliayakovlev commented 9 hours ago

@ndragazis

The issue is reproduced https://argus.scylladb.com/tests/scylla-cluster-tests/2984a530-6877-49a0-9b7c-79aeaa665cbf The cluster is alive.

The failure is on the perf-latency-nemesis-ubuntu-db-node-2984a530-3 (54.246.35.44) node. The SSTable is /var/lib/scylla/data/keyspace1/standard1-a5a29320a80611ef984ae2d13baa59f4/me-3glf_0zvj_1fva82i2dvfg17nn1d-big-Data.db

2024-11-21T13:01:20.379+00:00 perf-latency-nemesis-ubuntu-db-node-2984a530-3     !INFO | scylla[6338]:  [shard 6:comp] compaction - [Compact keyspace1.standard1 a05478f0-a808-11ef-a484-43bf07a89f61] 
Compacting of 3 sstables interrupted due to: sstables::malformed_sstable_exception (Failed to read partition from SSTable /var/lib/scylla/data/keyspace1/standard1-a5a29320a80611ef984ae2d13baa59f4/me-3glf_0zvj_1fva82i2dvfg17nn1d-big-Data.db due to Checksummed chunk of size 0 at file offset 332791808 failed checksum: expected=3035332331, actual=0), at 0x6217fae 0x62185d0 0x62188d8 0x2574dda 0x257312e 0x60ecfe6

2024-11-21T13:00:37.879+00:00 perf-latency-nemesis-ubuntu-db-node-2984a530-3     !INFO | scylla[6338]:  [shard 6:comp] compaction - [Compact keyspace1.standard1 8652c1f0-a808-11ef-a484-43bf07a89f61] 
Compacting of 3 sstables interrupted due to: sstables::malformed_sstable_exception (Failed to read partition from SSTable /var/lib/scylla/data/keyspace1/standard1-a5a29320a80611ef984ae2d13baa59f4/me-3glf_0zvj_1fva82i2dvfg17nn1d-big-Data.db due to Checksummed chunk of size 0 at file offset 332791808 failed checksum: expected=3035332331, actual=0), at 0x6217fae 0x62185d0 0x62188d8 0x2574dda 0x257312e 0x60ecfe6

2024-11-21T12:59:59.131+00:00 perf-latency-nemesis-ubuntu-db-node-2984a530-3     !INFO | scylla[6338]:  [shard 6:comp] compaction - [Compact keyspace1.standard1 6911be70-a808-11ef-a484-43bf07a89f61] 
Compacting of 3 sstables interrupted due to: sstables::malformed_sstable_exception (Failed to read partition from SSTable /var/lib/scylla/data/keyspace1/standard1-a5a29320a80611ef984ae2d13baa59f4/me-3glf_0zvj_1fva82i2dvfg17nn1d-big-Data.db due to Checksummed chunk of size 0 at file offset 332791808 failed checksum: expected=3035332331, actual=0), at 0x6217fae 0x62185d0 0x62188d8 0x2574dda 0x257312e 0x60ecfe6

@ndragazis Do you need this cluster or it is not relevant now?

ndragazis commented 9 hours ago

@ndragazis

The issue is reproduced https://argus.scylladb.com/tests/scylla-cluster-tests/2984a530-6877-49a0-9b7c-79aeaa665cbf The cluster is alive.

2024-11-21T13:01:20.379+00:00 perf-latency-nemesis-ubuntu-db-node-2984a530-3     !INFO | scylla[6338]:  [shard 6:comp] compaction - [Compact keyspace1.standard1 a05478f0-a808-11ef-a484-43bf07a89f61] 
Compacting of 3 sstables interrupted due to: sstables::malformed_sstable_exception (Failed to read partition from SSTable /var/lib/scylla/data/keyspace1/standard1-a5a29320a80611ef984ae2d13baa59f4/me-3glf_0zvj_1fva82i2dvfg17nn1d-big-Data.db due to Checksummed chunk of size 0 at file offset 332791808 failed checksum: expected=3035332331, actual=0), at 0x6217fae 0x62185d0 0x62188d8 0x2574dda 0x257312e 0x60ecfe6
$ ls -l me-3glf_0zvj_1fva82i2dvfg17nn1d-*
-rw-r--r-- 2 scylla scylla     282460 Nov 21 12:57 me-3glf_0zvj_1fva82i2dvfg17nn1d-big-CRC.db
-rw-r--r-- 2 scylla scylla 4627695442 Nov 21 12:57 me-3glf_0zvj_1fva82i2dvfg17nn1d-big-Data.db
-rw-r--r-- 2 scylla scylla         10 Nov 21 12:57 me-3glf_0zvj_1fva82i2dvfg17nn1d-big-Digest.crc32
-rw-r--r-- 2 scylla scylla    5217136 Nov 21 12:57 me-3glf_0zvj_1fva82i2dvfg17nn1d-big-Filter.db
-rw-r--r-- 2 scylla scylla   74872507 Nov 21 12:57 me-3glf_0zvj_1fva82i2dvfg17nn1d-big-Index.db
-rw-r--r-- 2 scylla scylla      90479 Nov 21 12:57 me-3glf_0zvj_1fva82i2dvfg17nn1d-big-Scylla.db
-rw-r--r-- 2 scylla scylla       4929 Nov 21 12:57 me-3glf_0zvj_1fva82i2dvfg17nn1d-big-Statistics.db
-rw-r--r-- 2 scylla scylla    2545298 Nov 21 12:57 me-3glf_0zvj_1fva82i2dvfg17nn1d-big-Summary.db
-rw-r--r-- 2 scylla scylla         90 Nov 21 12:54 me-3glf_0zvj_1fva82i2dvfg17nn1d-big-TOC.txt
$ python3 -c \
'with open("me-3glf_0zvj_1fva82i2dvfg17nn1d-big-CRC.db", "rb") as f:'\
'    print(int.from_bytes(f.read(4), "big"))'
65536

We have 4627695442 bytes of data, but the stream returns EOF at offset 332791808 ??? This is regular compaction that scans the whole file, right? Could this be a bug in read_exactly()?

The CRC contains 282460 bytes, which means one 4-byte integer for the chunk size, and 70614 4-byte checksums. This means we expect 70614 chunks in the Data file. The chunk size is 65536 bytes. Given that the last chunk can be smaller than the chunk size, the expected Data length can be between 4627693569 and 4627759104 bytes, which is true in our case.

ndragazis commented 8 hours ago

The SSTable is definitely not corrupted.

$ scylla sstable validate-checksums ./me-3glf_0zvj_1fva82i2dvfg17nn1d-big-Data.db | jq .
{
  "sstables": {
    "/var/lib/scylla/data/keyspace1/standard1-a5a29320a80611ef984ae2d13baa59f4/me-3glf_0zvj_1fva82i2dvfg17nn1d-big-Data.db": {
      "has_checksums": true,
      "valid_checksums": true
    }
  }
}
mykaul commented 8 hours ago

Perhaps something with encryption?