This PR addresses the issue where the index state incorrectly remains in "refreshing" after a streaming job has failed. The fix transitions the index state before Spark application exits on its best efforts. Ref: Flint index state transition diagram
Before the Changes
FlintJob waits for a global lock in StreamingQueryManager using the awaitAnyTermination API. The Spark StreamExecution notifies all threads suspending on it first, and then triggers the listener and cleanup logic.
Consequently, it's possible that main thread (FlintJob) completes first and does not wait for the index monitor or the listener and cleanup logic in the StreamExecution, as both are daemon threads.
After the Changes
A new awaitMonitor API in FlintSparkIndexMonitor has been introduced to suspend the caller thread (main thread in FlintJob) and update the index state immediately upon resumption.
As a result, FlintJob now wait for a specific stream execution and will be notified only after StreamExecution completes all listener and cleanup logic.
Sources that May Trigger the Termination of Stream Execution
Normal Termination: awaitMonitor does nothing upon termination to avoid conflicts. It's API responsibility to transition the index state in these cases.
a) DROP index API
b) ALTER index API (from auto to manual)
Exception Termination: there is possibility that both index monitor scheduled task and awaitMonitor tries to update index state in case b) below. Added retry to ensure the transition.
a) Index monitor scheduled task (terminates streaming execution when the OpenSearch cluster unreachable)
b) Spark terminates streaming execution upon encountering an exception
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.
Description
This PR addresses the issue where the index state incorrectly remains in "refreshing" after a streaming job has failed. The fix transitions the index state before Spark application exits on its best efforts. Ref: Flint index state transition diagram
Before the Changes
FlintJob
waits for a global lock inStreamingQueryManager
using theawaitAnyTermination
API. The SparkStreamExecution
notifies all threads suspending on it first, and then triggers the listener and cleanup logic.FlintJob
) completes first and does not wait for the index monitor or the listener and cleanup logic in theStreamExecution
, as both are daemon threads.After the Changes
awaitMonitor
API inFlintSparkIndexMonitor
has been introduced to suspend the caller thread (main thread inFlintJob
) and update the index state immediately upon resumption.FlintJob
now wait for a specific stream execution and will be notified only afterStreamExecution
completes all listener and cleanup logic.Sources that May Trigger the Termination of Stream Execution
awaitMonitor
does nothing upon termination to avoid conflicts. It's API responsibility to transition the index state in these cases.DROP
index APIALTER
index API (from auto to manual)TODO
Testing
Issues Resolved
https://github.com/opensearch-project/opensearch-spark/issues/361
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.