When working with Iceberg tables in Spark streaming jobs, the stream will terminate with an error if there are updated or deleted rows in the Iceberg table. Specifically, Iceberg throws an exception when an overwrite snapshot is encountered, which causes the Spark streaming job to fail.
What solution would you like?
The Flint index refresh job should be able to handle updated or deleted rows gracefully by using the options streaming-skip-overwrite-snapshots=true and streaming-skip-delete-snapshots=true to avoid termination. These options should be set by default for use cases involving streaming and incremental updates, allowing the job to continue processing without manual intervention.
If we pursue this approach, we need to determine how to elegantly configure the source operator when creating a streaming job. Currently, we have the FlintSparkSourceRelationProvider, which is primarily used for query rewriting. Additionally, we should consider configuring other defaults, such as maxFilesPerTrigger, which can help speed up progress and generate results more quickly for Flint materialized view refreshes.
What alternatives have you considered?
Alternatively, ensure that these options are well-documented and easily discoverable. So users can set them manually by extraOptions in index options in create index statement and avoid missing this critical step. Clear guidance would help users avoid job failures caused by unhandled overwrite or delete snapshots.
Is your feature request related to a problem?
When working with Iceberg tables in Spark streaming jobs, the stream will terminate with an error if there are updated or deleted rows in the Iceberg table. Specifically, Iceberg throws an exception when an overwrite snapshot is encountered, which causes the Spark streaming job to fail.
What solution would you like?
The Flint index refresh job should be able to handle updated or deleted rows gracefully by using the options
streaming-skip-overwrite-snapshots=true
andstreaming-skip-delete-snapshots=true to avoid termination
. These options should be set by default for use cases involving streaming and incremental updates, allowing the job to continue processing without manual intervention.What alternatives have you considered?
Alternatively, ensure that these options are well-documented and easily discoverable. So users can set them manually by
extraOptions
in index options in create index statement and avoid missing this critical step. Clear guidance would help users avoid job failures caused by unhandled overwrite or delete snapshots.Do you have any additional context?