Feat: Batch ingest iceberg/file source

chenzl25 commented 10 months ago

Is your feature request related to a problem? Please describe.

According to RFC: Combine Historical and Incremental Data we need to support ingesting data from an external source (e.g. iceberg or file source) as historical data. This is a typical scenario of bulk loading which is expected to be faster than streaming loading data from a Kafka.

To support this feature, we need to

[x] support create iceberg source. https://github.com/risingwavelabs/risingwave/pull/14971
[x] support iceberg catalog reader (via JNI to utilize the mature ecosystem from the Java SDK) and file enumerator.
[x] support iceberg data file (parquet, ORC) batch read. Recommended to use icelake. https://github.com/risingwavelabs/risingwave/pull/14915
[x] support iceberg batch plan and schedule task. https://github.com/risingwavelabs/risingwave/pull/15214
[x] #14866
[x] data type conversion between external souce and risingwave in memory chunk columns. https://github.com/risingwavelabs/risingwave/pull/14915
[x] distributed insert select. https://github.com/risingwavelabs/risingwave/pull/14630
[x] disable overwrite on conflict check for table when it has no downstream. https://github.com/risingwavelabs/risingwave/pull/14655#event-11565018045
[x] auto enable distributed dml for Iceberg ingestion. https://github.com/risingwavelabs/risingwave/pull/15481
[x] automatically derive columns for iceberg source. https://github.com/risingwavelabs/risingwave/pull/15415
[x] https://github.com/risingwavelabs/risingwave/issues/15509

Performance Improvement:

[x] Icelake provide file level read interface, so we can use it in the iceberg scan executor. cc @ZENOTME
[x] Column pruning for iceberg scan. https://github.com/risingwavelabs/risingwave/pull/16429
[ ] Utilize iceberg zonemap statistics to perform data skipping .aka. predicate push down.
[ ] Accelerate scan file planning and data accessing. Use alluxio as a cache.

Others:

[x] Give a type compatibility table between Iceberg and RisingWave data type. https://github.com/risingwavelabs/risingwave/pull/15298
[x] integration test https://github.com/risingwavelabs/risingwave/pull/15491 https://github.com/risingwavelabs/risingwave/pull/15550 https://github.com/risingwavelabs/risingwave/pull/15551 https://github.com/risingwavelabs/risingwave/pull/15535
[x] Support more catalog type for iceberg via jni.
[x] Support query iceberg table metadata https://github.com/risingwavelabs/risingwave/pull/16175 https://github.com/risingwavelabs/risingwave/pull/16180
[x] Read a specific snapshot from iceberg source. https://github.com/risingwavelabs/risingwave/pull/15866
[x] Support iceberg merge-on-read. https://www.notion.so/risingwave-labs/Iceberg-Support-read-from-MoR-table-4ca08d580ff74ab6a0c3296f059bf158?pvs=4

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

lmatz commented 10 months ago

Just try to provide the complete context, the details of the POC user request can be found: https://www.notion.so/risingwave-labs/optimize-parquet-source-for-batch-load-dc498a043d504621bf56461690b14bd7?d=84ebdf5d7469412680278059c5898be8

In short, if implementing the batch iceberg source takes much time due to its complexity, a parquet file source with decent performance is good enough to help move the POC forward. The user will consider switching RW only if RW's iceberg batch source is fast enough.

chenzl25 commented 10 months ago

In short, if implementing the batch iceberg source takes much time due to its complexity, a parquet file source with decent performance is good enough to help move the POC forward. The user will consider switching RW only if RW's iceberg batch source is fast enough.

Must the file format be Parquet? Is it possible to use a CSV file that has been supported in our file source? If it is ok to test a CSV file first, we can support file source batch read first, to test the performance. BTW, I tested insert select from a RisingWave table to another RisingWave table last week. Can we just compare the streaming load from Kafka to a table with insert select from a table to another table?

lmatz commented 10 months ago

Is it possible to use a CSV file that has been supported in our file source? If it is ok to test a CSV file first,

Good point, I think it is ok as CSV is a less efficient format than Parquet in terms of read and write performance. If we achieve decent enough performance for CSV files, we can be even faster when using Parquet. I guess it is a very convincing argument to the POC user.

BTW, I tested insert select from a RisingWave table to another RisingWave table last week. Can we just compare the streaming load from Kafka to a table with insert select from a table to another table?

I can try to communicate this first. The closer to the user's real use case, the better, but definitely nothing wrong if we use what we have at the moment, could you post the link to the last week's results? 🙏 I am also thinking of adding this to the daily performance tests

chenzl25 commented 10 months ago

I can try to communicate this first. The closer to the user's real use case, the better, but definitely nothing wrong if we use what we have at the moment, could you post the link to the last week's results? 🙏 I am also thinking of adding this to the daily performance tests

https://github.com/risingwavelabs/risingwave/pull/14630#issuecomment-1895954674

chenzl25 commented 4 months ago

https://github.com/risingwavelabs/risingwave/issues/15784

risingwavelabs / risingwave