opensearch-project / opensearch-spark

Spark Accelerator framework ; It enables secondary indices to remote data stores.
Apache License 2.0
12 stars 18 forks source link

[BUG] Flint index stuck in refreshing state after refresh job failure #361

Closed dai-chen closed 3 weeks ago

dai-chen commented 4 weeks ago

What is the bug?

The index state remains stuck in "refreshing" even after the associated streaming job has failed. This can confuse users and impedes monitoring systems, which rely on accurate state information to trigger alerts or initiate recovery processes.

How can one reproduce the bug?

Steps to reproduce the behavior:

  1. Create a Flint index to start streaming data.
  2. Introduce a failure scenario that causes the streaming job to terminate unexpectedly (e.g., readonly Flint data index).
  3. Observe that the index state is still reported as "refreshing" by SHOW FLINT INDEX statement

What is the expected behavior?

The expected behavior is that upon streaming job failure, the index state should automatically update to "failed". This update is crucial not only for user clarity, ensuring that the system status accurately reflects operational realities, but also for enabling automated monitoring systems to effectively manage alerts and recovery procedures based on the actual system state.

Do you have any screenshots?

N/A

Do you have any additional context?

Actually the system already incorporates logic to transition the index state upon job failures, located at https://github.com/opensearch-project/opensearch-spark/blob/main/flint-spark-integration/src/main/scala/org/opensearch/flint/spark/FlintSparkIndexMonitor.scala#L109. However, this logic is part of a scheduled task that runs every minute. This delay in execution could be the reason why the index remains stuck in the "refreshing" state, despite the underlying job having failed.