opensearch-project / opensearch-spark

Spark Accelerator framework ; It enables secondary indices to remote data stores.
Apache License 2.0
12 stars 18 forks source link

Abstracting source relations for enhanced covering index rewriting #391

Closed dai-chen closed 2 days ago

dai-chen commented 1 week ago

Description

This PR introduces new abstractions for the covering index query rewriter, facilitating support for different source table relation matching and rewriting. This enhancement paves the way for future support of Iceberg table relations.

PR Planned

Changes

Added new FlintSparkSourceRelationProvider and FlintSparkSourceRelation abstraction. Please see Scala doc for its responsibility in details. Basically,

Will refactor ApplyFlintSparkSkippingIndex and FlintSparkValidationHelper.isTableProviderSupported based on these in future.

Screenshot 2024-04-30 at 11 55 56 AM

Testing

spark-sql> CREATE INDEX all ON myglue.ds_tables.http_logs
         > (
         >   `@timestamp`,
         >   clientip,
         >   request,
         >   status,
         >   size
         > );

scala> sc.setLogLevel("INFO")
scala> sql("EXPLAIN SELECT clientip FROM myglue.ds_tables.http_logs WHERE status != 200").show

# Logging explains whether and why the index is applied
24/05/03 17:51:17 INFO FlintSparkSourceRelationProvider: Loaded source relation providers [file]
24/05/03 17:51:17 INFO ApplyFlintSparkCoveringIndex: Provider [file] can match sub plan LogicalRelation
24/05/03 17:51:18 INFO ApplyFlintSparkCoveringIndex: Found covering index 
[flint_myglue_ds_tables_http_logs_all_index] on table myglue.ds_tables.http_logs
24/05/03 17:51:18 INFO ApplyFlintSparkCoveringIndex:
 Is covering index flint_myglue_ds_tables_http_logs_all_index applicable: true
   Index state: Some(active)
   Index filter condition: None
   Columns required: Set(clientip, status)
   Columns indexed: Set(@timestamp, request, size, clientip, status)

Issues Resolved

https://github.com/opensearch-project/opensearch-spark/issues/298

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.