Open zwang28 opened 6 months ago
I found several suspicious things:
unregister_table_ids_fail_fast
is called (here).unregister_table_ids_fail_fast
will not be called if removed_actors
is emptyunregister_table_fragments_vec
is called (here)unregister_table_fragments_vec
is called at the end of drop_streaming_jobs_impl. This means if there is an error happening earlier (e.g. barrier_scheduler.run_command(Command::DropStreamingJobs)
fails due to CN failure or long barrier latency), we may miss to unregister table ids.I found several suspicious things:
- In ddl_controller_v2.rs drop_object, catalog is dropped first (here) before
unregister_table_ids_fail_fast
is called (here).- In drop_streaming_jobs_v2,
unregister_table_ids_fail_fast
will not be called ifremoved_actors
is empty- in drop_streaming_job_v1, catalog is dropped first (here) before
unregister_table_fragments_vec
is called (here)unregister_table_fragments_vec
is called at the end of drop_streaming_jobs_impl. This means if there is an error happening earlier (e.g.barrier_scheduler.run_command(Command::DropStreamingJobs)
fails due to CN failure or long barrier latency), we may miss to unregister table ids.
The recovery process should always take care of the clearing, e.g. even if drop_streaming_jobs_impl returns early.
However, recovery does clearing based on dirty fragments. If dirty fragments is empty (i.e. cleared before recovery), then dirty table is not cleared in Hummock. :thinking:
refactored in #15852
The issue still occurs in v1.8.
Failed to fetch filter key extractor tables [1,2,3. [2] may be removed by meta-service
One cause is explained in #16511
This issue has been open for 60 days with no activity.
If you think it is still relevant today, and needs to be done in the near future, you can comment to update the status, or just manually remove the no-issue-activity
label.
You can also confidently close this issue as not planned to keep our backlog clean. Don't worry if you think the issue is still valuable to continue in the future. It's searchable and can be reopened when it's time. 😄
Describe the bug
We've encountered this bug several times:
In meta node, whenever a table is remove from catalog, it must notify hummock manager as well. Otherwise it could lead to the issue above. So I suspect there is a corner case that doesn't follow this rule, during dropping a stream job.
Error message/log
No response
To Reproduce
No response
Expected behavior
No response
How did you deploy RisingWave?
No response
The version of RisingWave
No response
Additional context
No response