opensearch-project / opensearch-spark

Spark Accelerator framework ; It enables secondary indices to remote data stores.
Apache License 2.0
12 stars 18 forks source link

[FEATURE] Reduce latency in writing query result index #375

Open dai-chen opened 2 weeks ago

dai-chen commented 2 weeks ago

Is your feature request related to a problem?

During investigation in Issue 368, approximately a 5-second delay was observed between the completion of a Spark SQL statement and the subsequent await monitor API call. The logic executed during this period primarily involves writing the result document to the result index in OpenSearch.

The following suspects were identified:

  1. FlintJob writes to the result index by constructing a Spark data frame, even though successful DDL statements always yield an empty result.
  2. The data frame uses the Flint data source, although the destination index is a regular OpenSearch index.
  3. FlintWriter employs wait_for to ensure index refresh, which adds a 1-second delay if no refresh occurs.

What solution would you like?

  1. Write the result doc directly using the OpenSearch client, as it's always a single document with the result data pulled into driver memory.
  2. Force a refresh instead of using wait_for, provided the impact is acceptable after verification.

What alternatives have you considered?

N/A

Do you have any additional context?

We need to verify if a similar issue exists in FlintREPL, as the delay could have a more significant impact on direct REPL queries.