Open dai-chen opened 1 month ago
To implement the approach outlined above, we need to integrate the OpenSearch table concept into SparkSQL by:
CREATE TABLE
statement for the OpenSearch catalog.INSERT
statement for OpenSearch tables.In addition to these primary tasks, several supplementary efforts are required, such as handling data type mapping between SparkSQL and OpenSearch, as discussed in issue #699. Given the complexity of these tasks, I am currently exploring alternative approaches that avoid exposing the OpenSearch table concept for now, such as COPY
command previously proposed in https://github.com/opensearch-project/opensearch-spark/issues/129#issuecomment-2387009768.
Is your feature request related to a problem?
Currently, there is no way to load data from a source table to an OpenSearch index while controlling the number of rows in each batch. Users are forced to rely on covering indexes or materialized views for data loading, but these only allow controlling the number of files or the total byte size (Spark 4.0) per refresh.
What solution would you like?
To achieve row-based batching, I propose allowing users to utilize a low-level
INSERT
statement directly on the OpenSearch table. This would enable users to control the number of rows loaded in each batch by specifying row ranges, similar to:What alternatives have you considered?
In certain conditions, such as when no filtering is applied or filtering is limited to partitions, it may be possible to implement row-based control within covering indexes or materialized views. However, it's essential to evaluate whether this approach aligns with the intended behavior and design of Flint index refresh.
Do you have any additional context?