bug: dedicated cdc source writes the same full key more than once in an epoch

hzxa21 commented 5 months ago

Describe the bug

Recently there are two user reporting the following assertion triggered in compaction both on the data related to cdc source state table: https://github.com/risingwavelabs/risingwave/blob/9f2ac7e06d82f03c94dcb3db549cd1c8a9ccdb8d/src/storage/hummock_sdk/src/key.rs#L1069

Here are some info for the two user reports:

v..7.2: brand new cluster and compactor panics after table creation. No more information provided.

v1.8.0: brand new cluster and compactor panics after table creation (using dedicated cdc source).

Log:

key UserKey { 13, TableKey { 00000001313200000000000002 } } epoch EpochWithGap(6258454417244160) >= prev epoch EpochWithGap(6258454417244160)
stack backtrace:
thread 'rw-compaction' panicked at /risingwave/src/storage/hummock_sdk/src/key.rs:1066:21:
key UserKey { 13, TableKey { 00000001313200000000000002 } } epoch EpochWithGap(6258387423461376) >= prev epoch EpochWithGap(6258387423461376)

rw_internal_tables:


select * from rw_internal_tables where id = 13;
-[ RECORD 1 ]------------------+-------------------------------------------------------------------------------
id                             | 13
name                           | __internal_****_2_source_0
schema_id                      | 7
owner                          | 1
definition                     | 
acl                            | {****}
initialized_at                 | 2024-04-09 13:45:42+00:00
created_at                     | 2024-04-09 13:45:42+00:00
initialized_at_cluster_version | PostgreSQL 13.14.0-RisingWave-1.8.0 (96c76cae54de990d310d243018dfd4b054118e3e)
created_at_cluster_version     | PostgreSQL 13.14.0-RisingWave-1.8.0 (96c76cae54de990d310d243018dfd4b054118e3e)

- rw_hummock_sstables. All SSTs related to table 13 is in level0 and only sub_level `6258454417244160` (the epoch in the panic log) has two SSTs. **That means there are 2 CNs writing files in the same checkpoint epoch for table id, which is strange because direct cdc source only has one parallelism and there should only be one CN writing data to table 13 in this epoch.**

compaction_group_id | level_id |   sub_level_id   | sstable_id | file_size 
---------------------+----------+------------------+------------+-----------

.... 2 | 0 | 6258387423461376 | 23250 | 788 2 | 0 | 6258387423461376 | 23240 | 788 ....


- The sst dump of sst `23250` and `23240` shows that these two SSTs contain only a single entry with `FullKey { UserKey { 13, TableKey { 00000001313200000000000002 } }, epoch: 6258387423461376, epoch_with_gap: 6258387423461376, spill_offset: 0}, len=25`. Full sst dump result can be found [here](https://risingwave-labs.slack.com/archives/C064SBT0ASF/p1712741584405339?thread_ts=1712737224.141439&cid=C064SBT0ASF).

Error message/log

No response

To Reproduce

No response

Expected behavior

No response

How did you deploy RisingWave?

No response

The version of RisingWave

No response

Additional context

No response

hzxa21 commented 5 months ago

@yezizp2012 @StrikeW We may need to check whether this is a bug in cdc source or in meta. If there is a race in actor assignment in meta, it may affect other use cases as well.

tcodehuber commented 4 months ago

I met the same issue after I rebooted the risingwave cluster. Error log is found in the compactor pod:

thread 'rw-compaction' panicked at /risingwave/src/storage/hummock_sdk/src/key.rs:1066:21: key UserKey { 246, TableKey { 00000001312d30000000000003 } } epoch EpochWithGap(6317061581963264) >= prev epoch EpochWithGap(6317061581963264) 2024-04-20T15:34:02.883393802Z INFO risingwave_storage::hummock::compactor::compactor_runner: Ready to handle compaction group 2 task: 147635 compact_task_statistics CompactTaskStatistics { total_file_count: 44, total_key_count: 46, total_file_size: 25004, total_uncompressed_file_size: 24590 } target_level 0 compression_algorithm 0 table_ids [76, 86, 96, 106, 111, 131, 141, 196, 201, 226, 231, 246, 256] parallelism 1

hzxa21 commented 2 months ago

Another occurrence that is probably related to this issue: https://risingwave-community.slack.com/archives/C03BW71523T/p1719592780560509

zwang28 commented 1 month ago

Another occurance in v1.10.0-rc3. But because the cluster has already been reset, we don't know the kind of problematic table.

hzxa21 commented 1 month ago

Another occurrence: https://risingwave-community.slack.com/archives/C03BW71523T/p1722607407932359?thread_ts=1719592780.560509&cid=C03BW71523T

risingwavelabs / risingwave