Open juliayakovlev opened 14 hours ago
3a6c0a9b36 Merge 'compaction: Perform integrity checks on compacting SSTables' from Nikos Dragazis
Sounds like a possible suspect. @ndragazis ?
Looks like a premature EOF. The input stream expects to find more SSTable data than what is actually available. Not sure how we get there, will have to take a closer look.
How urgent is this? Also note that I don't have access to the corrupted files. Could you tell me what the file lengths are for the data and CRC components of the problematic SSTables?
Looks like a premature EOF. The input stream expects to find more SSTable data than what is actually available. Not sure how we get there, will have to take a closer look.
How urgent is this? Also note that I don't have access to the corrupted files. Could you tell me what the file lengths are for the data and CRC components of the problematic SSTables?
@ndragazis I changed permissions for this tar. Should be available now. Please, try again
Yes, I have access now. I see that we only collect the data component, not the whole SSTable. This is a problem in this case because uncompressed SSTables store the checksums and data in separate files. So, I'll have to reproduce it.
Yes, I have access now. I see that we only collect the data component, not the whole SSTable. This is a problem in this case because uncompressed SSTables store the checksums and data in separate files. So, I'll have to reproduce it.
Are those uncompressed sstables?
Yes, I have access now. I see that we only collect the data component, not the whole SSTable. This is a problem in this case because uncompressed SSTables store the checksums and data in separate files. So, I'll have to reproduce it.
@ndragazis I can re-run the test for you and keep the cluster
Yes, I have access now. I see that we only collect the data component, not the whole SSTable. This is a problem in this case because uncompressed SSTables store the checksums and data in separate files. So, I'll have to reproduce it.
Are those uncompressed sstables?
I collected SStables that in error message
Yes, I have access now. I see that we only collect the data component, not the whole SSTable. This is a problem in this case because uncompressed SSTables store the checksums and data in separate files. So, I'll have to reproduce it.
Are those uncompressed sstables?
I collected SStables that in error message
When collecting sstables, please always collect all components, not just the Data one. The Data component alone is never enough to diagnose a problem related to sstables.
Are those uncompressed sstables?
Yes, we can tell from the fact that the "checksummed" data source is used. For compressed SSTables we use the "compressed" data source.
Are those uncompressed sstables?
Yes, we can tell from the fact that the "checksummed" data source is used. For compressed SSTables we use the "compressed" data source.
@ndragazis I'd look for some logical error like memory corruption, given that it has been reproduced more than once. I don't think it's an actual corruption (but who knows...), I think we can verify that by checking if CRC indeed has zeroed chunk size. Have we modified the checksummed data sink in your series? Just the read path, right?
@ndragazis why are we reading checksum metadata on every reader? I think it would be the best to read it once when loading the sstable and reuse it for every read, since metadata is immutable. that can be the cause of perf drop for reading uncompressed files.
as for the corruption, I have a theory: the data source is outliving the reader somehow (we have seen such problems before), and at that point, the former holds a stale reference to checksum. but let's confirm we don't have a corruption first, by verifying the sstable reported as bad is actually bad (scylla sstable tool can be used to verify the integrity of files, there's a checksum verification command).
integrity_check _integrity;
lw_shared_ptr<checksum> _checksum;
// For reversed (single partition) reads, points to the current position in the sstable
// of the reversing data source used underneath (see `partition_reversing_data_source`).
@@ -1279,7 +1281,8 @@ class mx_sstable_mutation_reader : public mp_row_consumer_reader_mx {
tracing::trace_state_ptr trace_state,
streamed_mutation::forwarding fwd,
mutation_reader::forwarding fwd_mr,
read_monitor& mon)
read_monitor& mon,
integrity_check integrity)
: mp_row_consumer_reader_mx(std::move(schema), permit, std::move(sst))
, _slice_holder(std::move(slice))
, _slice(_slice_holder.get())
@@ -1291,7 +1294,8 @@ class mx_sstable_mutation_reader : public mp_row_consumer_reader_mx {
, _pr(pr)
, _fwd(fwd)
, _fwd_mr(fwd_mr)
, _monitor(mon) {
, _monitor(mon)
, _integrity(integrity) {
if (reversed()) {
if (!_single_partition_read) {
on_internal_error(sstlog, format(
@@ -1543,21 +1547,33 @@ class mx_sstable_mutation_reader : public mp_row_consumer_reader_mx {
sstlog.trace("sstable_reader: {}: data file range [{}, {})", fmt::ptr(this), begin, *end);
if (_integrity) {
// Caller must retain a reference to checksum component while in use by the stream.
_checksum = co_await _sst->read_checksum();
@juliayakovlev - can you check if the test actually creates uncompressed sstables?
@ndragazis
The issue is reproduced https://argus.scylladb.com/tests/scylla-cluster-tests/2984a530-6877-49a0-9b7c-79aeaa665cbf The cluster is alive.
The failure is on the perf-latency-nemesis-ubuntu-db-node-2984a530-3 (54.246.35.44)
node.
The SSTable is /var/lib/scylla/data/keyspace1/standard1-a5a29320a80611ef984ae2d13baa59f4/me-3glf_0zvj_1fva82i2dvfg17nn1d-big-Data.db
2024-11-21T13:01:20.379+00:00 perf-latency-nemesis-ubuntu-db-node-2984a530-3 !INFO | scylla[6338]: [shard 6:comp] compaction - [Compact keyspace1.standard1 a05478f0-a808-11ef-a484-43bf07a89f61]
Compacting of 3 sstables interrupted due to: sstables::malformed_sstable_exception (Failed to read partition from SSTable /var/lib/scylla/data/keyspace1/standard1-a5a29320a80611ef984ae2d13baa59f4/me-3glf_0zvj_1fva82i2dvfg17nn1d-big-Data.db due to Checksummed chunk of size 0 at file offset 332791808 failed checksum: expected=3035332331, actual=0), at 0x6217fae 0x62185d0 0x62188d8 0x2574dda 0x257312e 0x60ecfe6
2024-11-21T13:00:37.879+00:00 perf-latency-nemesis-ubuntu-db-node-2984a530-3 !INFO | scylla[6338]: [shard 6:comp] compaction - [Compact keyspace1.standard1 8652c1f0-a808-11ef-a484-43bf07a89f61]
Compacting of 3 sstables interrupted due to: sstables::malformed_sstable_exception (Failed to read partition from SSTable /var/lib/scylla/data/keyspace1/standard1-a5a29320a80611ef984ae2d13baa59f4/me-3glf_0zvj_1fva82i2dvfg17nn1d-big-Data.db due to Checksummed chunk of size 0 at file offset 332791808 failed checksum: expected=3035332331, actual=0), at 0x6217fae 0x62185d0 0x62188d8 0x2574dda 0x257312e 0x60ecfe6
2024-11-21T12:59:59.131+00:00 perf-latency-nemesis-ubuntu-db-node-2984a530-3 !INFO | scylla[6338]: [shard 6:comp] compaction - [Compact keyspace1.standard1 6911be70-a808-11ef-a484-43bf07a89f61]
Compacting of 3 sstables interrupted due to: sstables::malformed_sstable_exception (Failed to read partition from SSTable /var/lib/scylla/data/keyspace1/standard1-a5a29320a80611ef984ae2d13baa59f4/me-3glf_0zvj_1fva82i2dvfg17nn1d-big-Data.db due to Checksummed chunk of size 0 at file offset 332791808 failed checksum: expected=3035332331, actual=0), at 0x6217fae 0x62185d0 0x62188d8 0x2574dda 0x257312e 0x60ecfe6
Have we modified the checksummed data sink in your series? Just the read path, right?
No, changes affect only the read path.
@ndragazis why are we reading checksum metadata on every reader? I think it would be the best to read it once when loading the sstable and reuse it for every read, since metadata is immutable.
Not sure what you mean by "every reader" but it's not used in all read paths. I first injected this reader into the validation path (https://github.com/scylladb/scylladb/pull/20207), which is used only by scrub, and later on in compaction as well (https://github.com/scylladb/scylladb/pull/21153). Any other read operation on uncompressed SSTables will still use a raw input stream. So, I don't think there is any merit in caching the checksum metadata.
as for the corruption, I have a theory: the data source is outliving the reader somehow (we have seen such problems before), and at that point, the former holds a stale reference to checksum.
Well, from the logs we see the reader reaching to file offset 1415774208. Assuming a 64KiB chunk size, this is 21603 chunks that have passed the checksum verification. So far so good, and then we hit an unexpected EOF, which also happens to be perfectly aligned at the chunk boundary.
Have we modified the checksummed data sink in your series? Just the read path, right?
No, changes affect only the read path.
@ndragazis why are we reading checksum metadata on every reader? I think it would be the best to read it once when loading the sstable and reuse it for every read, since metadata is immutable.
Not sure what you mean by "every reader" but it's not used in all read paths. I first injected this reader into the validation path (#20207), which is used only by scrub, and later on in compaction as well (#21153). Any other read operation on uncompressed SSTables will still use a raw input stream. So, I don't think there is any merit in caching the checksum metadata.
That's what I missed, thanks for the explanation.
The issue is reproduced https://argus.scylladb.com/tests/scylla-cluster-tests/2984a530-6877-49a0-9b7c-79aeaa665cbf The cluster is alive.
The failure is on the
perf-latency-nemesis-ubuntu-db-node-2984a530-3 (54.246.35.44)
node. The SSTable is/var/lib/scylla/data/keyspace1/standard1-a5a29320a80611ef984ae2d13baa59f4/me-3glf_0zvj_1fva82i2dvfg17nn1d-big-Data.db
2024-11-21T13:01:20.379+00:00 perf-latency-nemesis-ubuntu-db-node-2984a530-3 !INFO | scylla[6338]: [shard 6:comp] compaction - [Compact keyspace1.standard1 a05478f0-a808-11ef-a484-43bf07a89f61] Compacting of 3 sstables interrupted due to: sstables::malformed_sstable_exception (Failed to read partition from SSTable /var/lib/scylla/data/keyspace1/standard1-a5a29320a80611ef984ae2d13baa59f4/me-3glf_0zvj_1fva82i2dvfg17nn1d-big-Data.db due to Checksummed chunk of size 0 at file offset 332791808 failed checksum: expected=3035332331, actual=0), at 0x6217fae 0x62185d0 0x62188d8 0x2574dda 0x257312e 0x60ecfe6 2024-11-21T13:00:37.879+00:00 perf-latency-nemesis-ubuntu-db-node-2984a530-3 !INFO | scylla[6338]: [shard 6:comp] compaction - [Compact keyspace1.standard1 8652c1f0-a808-11ef-a484-43bf07a89f61] Compacting of 3 sstables interrupted due to: sstables::malformed_sstable_exception (Failed to read partition from SSTable /var/lib/scylla/data/keyspace1/standard1-a5a29320a80611ef984ae2d13baa59f4/me-3glf_0zvj_1fva82i2dvfg17nn1d-big-Data.db due to Checksummed chunk of size 0 at file offset 332791808 failed checksum: expected=3035332331, actual=0), at 0x6217fae 0x62185d0 0x62188d8 0x2574dda 0x257312e 0x60ecfe6 2024-11-21T12:59:59.131+00:00 perf-latency-nemesis-ubuntu-db-node-2984a530-3 !INFO | scylla[6338]: [shard 6:comp] compaction - [Compact keyspace1.standard1 6911be70-a808-11ef-a484-43bf07a89f61] Compacting of 3 sstables interrupted due to: sstables::malformed_sstable_exception (Failed to read partition from SSTable /var/lib/scylla/data/keyspace1/standard1-a5a29320a80611ef984ae2d13baa59f4/me-3glf_0zvj_1fva82i2dvfg17nn1d-big-Data.db due to Checksummed chunk of size 0 at file offset 332791808 failed checksum: expected=3035332331, actual=0), at 0x6217fae 0x62185d0 0x62188d8 0x2574dda 0x257312e 0x60ecfe6
@ndragazis Do you need this cluster or it is not relevant now?
@ndragazis
The issue is reproduced https://argus.scylladb.com/tests/scylla-cluster-tests/2984a530-6877-49a0-9b7c-79aeaa665cbf The cluster is alive.
2024-11-21T13:01:20.379+00:00 perf-latency-nemesis-ubuntu-db-node-2984a530-3 !INFO | scylla[6338]: [shard 6:comp] compaction - [Compact keyspace1.standard1 a05478f0-a808-11ef-a484-43bf07a89f61]
Compacting of 3 sstables interrupted due to: sstables::malformed_sstable_exception (Failed to read partition from SSTable /var/lib/scylla/data/keyspace1/standard1-a5a29320a80611ef984ae2d13baa59f4/me-3glf_0zvj_1fva82i2dvfg17nn1d-big-Data.db due to Checksummed chunk of size 0 at file offset 332791808 failed checksum: expected=3035332331, actual=0), at 0x6217fae 0x62185d0 0x62188d8 0x2574dda 0x257312e 0x60ecfe6
$ ls -l me-3glf_0zvj_1fva82i2dvfg17nn1d-*
-rw-r--r-- 2 scylla scylla 282460 Nov 21 12:57 me-3glf_0zvj_1fva82i2dvfg17nn1d-big-CRC.db
-rw-r--r-- 2 scylla scylla 4627695442 Nov 21 12:57 me-3glf_0zvj_1fva82i2dvfg17nn1d-big-Data.db
-rw-r--r-- 2 scylla scylla 10 Nov 21 12:57 me-3glf_0zvj_1fva82i2dvfg17nn1d-big-Digest.crc32
-rw-r--r-- 2 scylla scylla 5217136 Nov 21 12:57 me-3glf_0zvj_1fva82i2dvfg17nn1d-big-Filter.db
-rw-r--r-- 2 scylla scylla 74872507 Nov 21 12:57 me-3glf_0zvj_1fva82i2dvfg17nn1d-big-Index.db
-rw-r--r-- 2 scylla scylla 90479 Nov 21 12:57 me-3glf_0zvj_1fva82i2dvfg17nn1d-big-Scylla.db
-rw-r--r-- 2 scylla scylla 4929 Nov 21 12:57 me-3glf_0zvj_1fva82i2dvfg17nn1d-big-Statistics.db
-rw-r--r-- 2 scylla scylla 2545298 Nov 21 12:57 me-3glf_0zvj_1fva82i2dvfg17nn1d-big-Summary.db
-rw-r--r-- 2 scylla scylla 90 Nov 21 12:54 me-3glf_0zvj_1fva82i2dvfg17nn1d-big-TOC.txt
$ python3 -c \
'with open("me-3glf_0zvj_1fva82i2dvfg17nn1d-big-CRC.db", "rb") as f:'\
' print(int.from_bytes(f.read(4), "big"))'
65536
We have 4627695442 bytes of data, but the stream returns EOF at offset 332791808 ??? This is regular compaction that scans the whole file, right? Could this be a bug in read_exactly()
?
The CRC contains 282460 bytes, which means one 4-byte integer for the chunk size, and 70614 4-byte checksums. This means we expect 70614 chunks in the Data file. The chunk size is 65536 bytes. Given that the last chunk can be smaller than the chunk size, the expected Data length can be between 4627693569 and 4627759104 bytes, which is true in our case.
The SSTable is definitely not corrupted.
$ scylla sstable validate-checksums ./me-3glf_0zvj_1fva82i2dvfg17nn1d-big-Data.db | jq .
{
"sstables": {
"/var/lib/scylla/data/keyspace1/standard1-a5a29320a80611ef984ae2d13baa59f4/me-3glf_0zvj_1fva82i2dvfg17nn1d-big-Data.db": {
"has_checksums": true,
"valid_checksums": true
}
}
}
Perhaps something with encryption?
Packages
Scylla version:
6.3.0~dev-20241119.733a4f94c7b2
with build-id89a16d00ef82b3c15eb73e944afc60f15c478452
Kernel Version:6.8.0-1019-aws
Issue description
sstables::malformed_sstable_exception
while write load:Corrupted filed are collected and can be downloaded
Impact
Compacting failed, data corruption
How frequently does it reproduce?
Last successful run with Scylla version
6.3.0~dev-20241115.5bc03da0c4c1
: https://argus.scylladb.com/tests/scylla-cluster-tests/378c37dd-51c6-4623-87f7-f5419a5d976fCommits between
733a4f94c7b2
and5bc03da0c4c1
:Installation details
Cluster size: 3 nodes (i3en.2xlarge)
Scylla Nodes used in this run:
OS / Image:
ami-0d782b5935da36d59
(aws: undefined_region)Test:
scylla-master-perf-regression-latency-650gb-with-nemesis
Test id:1621284f-c8ee-45d7-a758-2e9088b28e8f
Test name:scylla-master/perf-regression/scylla-master-perf-regression-latency-650gb-with-nemesis
Test method:performance_regression_test.PerformanceRegressionTest.test_latency_write_with_nemesis
Test config file(s):Logs and commands
- Restore Monitor Stack command: `$ hydra investigate show-monitor 1621284f-c8ee-45d7-a758-2e9088b28e8f` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=1621284f-c8ee-45d7-a758-2e9088b28e8f) - Show all stored logs command: `$ hydra investigate show-logs 1621284f-c8ee-45d7-a758-2e9088b28e8f` ## Logs: - **db-cluster-1621284f.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/1621284f-c8ee-45d7-a758-2e9088b28e8f/20241120_145419/db-cluster-1621284f.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/1621284f-c8ee-45d7-a758-2e9088b28e8f/20241120_145419/db-cluster-1621284f.tar.gz) - **sct-runner-events-1621284f.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/1621284f-c8ee-45d7-a758-2e9088b28e8f/20241120_145419/sct-runner-events-1621284f.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/1621284f-c8ee-45d7-a758-2e9088b28e8f/20241120_145419/sct-runner-events-1621284f.tar.gz) - **sct-1621284f.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/1621284f-c8ee-45d7-a758-2e9088b28e8f/20241120_145419/sct-1621284f.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/1621284f-c8ee-45d7-a758-2e9088b28e8f/20241120_145419/sct-1621284f.log.tar.gz) - **loader-set-1621284f.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/1621284f-c8ee-45d7-a758-2e9088b28e8f/20241120_145419/loader-set-1621284f.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/1621284f-c8ee-45d7-a758-2e9088b28e8f/20241120_145419/loader-set-1621284f.tar.gz) - **monitor-set-1621284f.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/1621284f-c8ee-45d7-a758-2e9088b28e8f/20241120_145419/monitor-set-1621284f.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/1621284f-c8ee-45d7-a758-2e9088b28e8f/20241120_145419/monitor-set-1621284f.tar.gz) - **corrupt files** - [https://cloudius-jenkins-test.s3.us-east-1.amazonaws.com/1621284f-c8ee-45d7-a758-2e9088b28e8f/20241120_153138/corrupfiles_1621284f.tar.gz](https://cloudius-jenkins-test.s3.us-east-1.amazonaws.com/1621284f-c8ee-45d7-a758-2e9088b28e8f/20241120_153138/corrupfiles_1621284f.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-master/job/perf-regression/job/scylla-master-perf-regression-latency-650gb-with-nemesis/31/) [Argus](https://argus.scylladb.com/test/4f088858-cbf2-4984-9d44-d319c207f521/runs?additionalRuns[]=1621284f-c8ee-45d7-a758-2e9088b28e8f)