Open kwannoel opened 1 month ago
@wenym1 Can you help take a look? I checked the relevant code paths and didn't find a clue on how this can happen.
meta: MV creation finishes at around 16:36:06
2024-08-07T16:35:47.098154218Z INFO risingwave_meta::rpc::ddl_controller: finish stream job succeeded id=1
2024-08-07T16:36:00.777084324Z INFO risingwave_meta::rpc::ddl_controller: finish stream job succeeded id=2
2024-08-07T16:36:06.114203651Z INFO risingwave_meta::rpc::ddl_controller: finish stream job succeeded id=4
frontend: Query fail at around 16:36:08
2024-08-07T16:36:08.819965654Z ERROR risingwave_frontend::scheduler::distributed::stage: Task TaskId { query_id: "ffaf7aeb-8a3e-4035-89c5-747bb7d5f707", stage_id: 1, task_id: 8 } failed, reason: "Storage error: Hummock error: Expired Epoch: watermark 6934584757125120, epoch 6934584644927488"
2024-08-07T16:36:08.820039093Z ERROR risingwave_frontend::scheduler::distributed::stage: Task TaskId { query_id: "ffaf7aeb-8a3e-4035-89c5-747bb7d5f707", stage_id: 2, task_id: 4 } failed, reason: "Storage error: Hummock error: Expired Epoch: watermark 6934584757125120, epoch 6934584644927488"
2024-08-07T16:36:08.820048701Z ERROR risingwave_frontend::scheduler::distributed::query: query stage failed error=Storage error: Hummock error: Expired Epoch: watermark 6934584757125120, epoch 6934584644927488 query_id=QueryId { id: "ffaf7aeb-8a3e-4035-89c5-747bb7d5f707" } stage_id=1
2024-08-07T16:36:08.820286629Z ERROR pgwire::pg_protocol: error when process message error=Failed to run the query: Storage error: Hummock error: Expired Epoch: watermark 6934584757125120, epoch 6934584644927488
compactor: safe epoch is bumped to 6934584757125120
by compact task 23 finished at 16:36:08
2024-08-07T16:36:07.022009424Z INFO risingwave_storage::hummock::compactor::compactor_runner: Ready to handle task: 23 compact_task_statistics CompactTaskStatistics { total_file_count: 8, total_key_count: 674, total_file_size: 21770, total_uncompressed_file_size: 21698 } compression_algorithm 0 parallelism 1 task_memory_capacity_with_parallelism 17342096, enable fast runner: false, Compaction task id: 23, group-id: 2, type: Dynamic, target level: 0, target sub level: 6934584478662656 watermark: 6934584757125120, target_file_size: 33554432, splits: 1
Input:
Level 0 ["[id: 157, obj_id: 138 0KB stale_ratio 0]", "[id: 158, obj_id: 140 0KB stale_ratio 0]"]
Level 0 ["[id: 155, obj_id: 90 5KB stale_ratio 80]"]
Level 0 ["[id: 147, obj_id: 75 2KB stale_ratio 0]", "[id: 148, obj_id: 77 0KB stale_ratio 0]", "[id: 150, obj_id: 96 2KB stale_ratio 0]", "[id: 151, obj_id: 98 0KB stale_ratio 0]", "[id: 153, obj_id: 88 7KB stale_ratio 66]"]
2024-08-07T16:36:08.021001239Z INFO risingwave_storage::hummock::compactor: running_parallelism_count=2 pull_task_ack=false pending_pull_task_count=4
2024-08-07T16:36:08.047053706Z INFO risingwave_storage::hummock::compactor: Finish Task Some(23) split_index Some(0) sst count 1
2024-08-07T16:36:08.047124571Z INFO risingwave_storage::hummock::compactor::compactor_runner: Finished compaction task in 1025.137239ms: Compaction task id: 23, group-id: 2, type: Dynamic, target level: 0, target sub level: 6934584478662656 watermark: 6934584757125120, target_file_size: 33554432, splits: 1, status: Success
Output:
SstableInfo: object id=128, SST id=128, KeyRange=["0000000500000080000018a2f7c09b0000","0000000500ff0080ff0018a2f7c09b0000"], table_ids: [5], size=8KB, stale_ratio=0%, bloom_filter_kind Sstable
It seems that after this PR #5832, the storage iterator creation is moved from new_boxed_executor
to do_execute
. Frontend will unpin epoch after the table scan stage is Scheduled
. Will the table scan stage of a distributed query become Scheduled after new_boxed_executor
or after do_execute
? If it is the former, it is possible for the epoch to be unpinned before the storage version is pinned, which is done in storage iterator creation. cc @BugenZhao @chenzl25
I remember our SeqScanExecutor
will create iterators during execute()
sequentially which means we can't unpin epoch after the executor creation. Instead, we should unpin snapshot until the end of the query execution as how we deal with lookup join.
buildkite: https://buildkite.com/risingwavelabs/main-cron/builds/3046#01912d92-7eef-47eb-ae2a-9679177ef2a9
Error:
This occurs when running the following batch query after no shuffle backfill: