[FEATURE] Enhance Alter index statement to support schema evolution

dai-chen commented 5 months ago

Is your feature request related to a problem?

Currently, the ALTER INDEX statement in Flint only supports changing index options, such as index refresh mode (refer to the documentation: ALTER INDEX Options). For table formats like Iceberg, which support schema evolution, there is no way for users to add more indexed columns without re-creating the index.

What solution would you like?

Introduce support for schema changes in the ALTER INDEX statement similar to the ALTER TABLE statement. This would include:

ALTER SKIPPING INDEX ON <tableName> ADD COLUMN ...
ALTER INDEX <indexName> ON <tableName> ADD COLUMN ...

What alternatives have you considered?

The only current alternative is for users to re-create the index whenever they need to add more indexed columns.

Do you have any additional context?

It is important to note that this change will impact the correctness of the query rewriter, as the new columns added to the Flint index will be missing in old data:

Skipping index: There is already null check to ensure that null fields do not skip anything. Need to double check if no impact on all skip types. Code: https://github.com/opensearch-project/opensearch-spark/blob/main/flint-spark-integration/src/main/scala/org/opensearch/flint/spark/skipping/valueset/ValueSetSkippingStrategy.scala#L57
Covering index: It's possible that a null column after covering index rewrite is non-null in the source data. This discrepancy needs to be addressed.

dai-chen commented 2 months ago

For MV, the typical use case is to support using new column in MV query when schema evolution on source table.

dai-chen commented 1 month ago

After a quick research, it appears that no major database vendors currently support altering the underlying query of a materialized view. As an alternative workaround, users can create a new materialized view with the updated query and preserve access to both old and new datasets by SQL UNION or OpenSearch index pattern.

Example:

-- Create the first materialized view
CREATE MATERIALIZED VIEW mv_1
AS
SELECT ...
GROUP BY TUMBLE(time, '1 Hour')
WITH (
  auto_refresh = true
);

-- Alter the materialized view to stop auto refresh (not necessary for manual refresh)
ALTER MATERIALIZED VIEW mv2
WITH (
  auto_refresh = false
);

/*
 * Identify where the old materialized view stops (e.g. last bucket is '2024-09-01 10:00:00').
 * Data after this point is either unprocessed or held in the state store as an open window pending output.
 * a) Stop right before '2024-09-01 11:00:00': the data will be processed by new materialized view
 * b) Stop in the middle of '2024-09-01 11:00:00': the data will be re-processed by new materialized view
 */

-- Create new materialized view from next time interval
CREATE MATERIALIZED VIEW mv_1
AS
SELECT ...
WHERE time > '2024-09-01 10:00:00'
GROUP BY TUMBLE(time, 'interval')
WITH (
  auto_refresh = true
);

-- Query both old and new materialized view data using index pattern in OpenSearch
POST /flint_mv_*/_search

opensearch-project / opensearch-spark

[FEATURE] Enhance Alter index statement to support schema evolution #387