risingwavelabs / risingwave

Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming and batch. PostgreSQL compatible.
https://go.risingwave.com/slack
Apache License 2.0
6.78k stars 561 forks source link

bug: inconsistency table info between catalog and hummock in meta node #15144

Open zwang28 opened 6 months ago

zwang28 commented 6 months ago

Describe the bug

We've encountered this bug several times:

  1. Compactor nodes keep failing task because inconsistency table info between catalog and hummock manager.
  2. After restarting meta node, the issue is gone. It works due to the rectification process during meta node startup.

In meta node, whenever a table is remove from catalog, it must notify hummock manager as well. Otherwise it could lead to the issue above. So I suspect there is a corner case that doesn't follow this rule, during dropping a stream job.

Error message/log

No response

To Reproduce

No response

Expected behavior

No response

How did you deploy RisingWave?

No response

The version of RisingWave

No response

Additional context

No response

hzxa21 commented 6 months ago

I found several suspicious things:

  1. In ddl_controller_v2.rs drop_object, catalog is dropped first (here) before unregister_table_ids_fail_fast is called (here).
  2. In drop_streaming_jobs_v2, unregister_table_ids_fail_fast will not be called if removed_actors is empty
  3. in drop_streaming_job_v1, catalog is dropped first (here) before unregister_table_fragments_vec is called (here)
  4. unregister_table_fragments_vec is called at the end of drop_streaming_jobs_impl. This means if there is an error happening earlier (e.g. barrier_scheduler.run_command(Command::DropStreamingJobs) fails due to CN failure or long barrier latency), we may miss to unregister table ids.
zwang28 commented 6 months ago

I found several suspicious things:

  1. In ddl_controller_v2.rs drop_object, catalog is dropped first (here) before unregister_table_ids_fail_fast is called (here).
  2. In drop_streaming_jobs_v2, unregister_table_ids_fail_fast will not be called if removed_actors is empty
  3. in drop_streaming_job_v1, catalog is dropped first (here) before unregister_table_fragments_vec is called (here)
  4. unregister_table_fragments_vec is called at the end of drop_streaming_jobs_impl. This means if there is an error happening earlier (e.g. barrier_scheduler.run_command(Command::DropStreamingJobs) fails due to CN failure or long barrier latency), we may miss to unregister table ids.

The recovery process should always take care of the clearing, e.g. even if drop_streaming_jobs_impl returns early.

However, recovery does clearing based on dirty fragments. If dirty fragments is empty (i.e. cleared before recovery), then dirty table is not cleared in Hummock. :thinking:

zwang28 commented 5 months ago

refactored in #15852

zwang28 commented 5 months ago

The issue still occurs in v1.8.

  1. drop table succeeded
  2. compactor raised error Failed to fetch filter key extractor tables [1,2,3. [2] may be removed by meta-service

One cause is explained in #16511

github-actions[bot] commented 2 months ago

This issue has been open for 60 days with no activity.

If you think it is still relevant today, and needs to be done in the near future, you can comment to update the status, or just manually remove the no-issue-activity label.

You can also confidently close this issue as not planned to keep our backlog clean. Don't worry if you think the issue is still valuable to continue in the future. It's searchable and can be reopened when it's time. 😄