risingwavelabs / risingwave

Scalable SQL engine for event streams and time series data. Build live dashboards, event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming and batch processing. PostgreSQL compatible.
https://www.risingwave.com/slack
Apache License 2.0
6.67k stars 547 forks source link

Feat: Batch ingest iceberg/file source #14742

Open chenzl25 opened 6 months ago

chenzl25 commented 6 months ago

Is your feature request related to a problem? Please describe.

According to RFC: Combine Historical and Incremental Data we need to support ingesting data from an external source (e.g. iceberg or file source) as historical data. This is a typical scenario of bulk loading which is expected to be faster than streaming loading data from a Kafka.

To support this feature, we need to

Performance Improvement:

Others:

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

lmatz commented 6 months ago

Just try to provide the complete context, the details of the POC user request can be found: https://www.notion.so/risingwave-labs/optimize-parquet-source-for-batch-load-dc498a043d504621bf56461690b14bd7?d=84ebdf5d7469412680278059c5898be8

In short, if implementing the batch iceberg source takes much time due to its complexity, a parquet file source with decent performance is good enough to help move the POC forward. The user will consider switching RW only if RW's iceberg batch source is fast enough.

chenzl25 commented 6 months ago

In short, if implementing the batch iceberg source takes much time due to its complexity, a parquet file source with decent performance is good enough to help move the POC forward. The user will consider switching RW only if RW's iceberg batch source is fast enough.

Must the file format be Parquet? Is it possible to use a CSV file that has been supported in our file source? If it is ok to test a CSV file first, we can support file source batch read first, to test the performance. BTW, I tested insert select from a RisingWave table to another RisingWave table last week. Can we just compare the streaming load from Kafka to a table with insert select from a table to another table?

lmatz commented 6 months ago

Is it possible to use a CSV file that has been supported in our file source? If it is ok to test a CSV file first,

Good point, I think it is ok as CSV is a less efficient format than Parquet in terms of read and write performance. If we achieve decent enough performance for CSV files, we can be even faster when using Parquet. I guess it is a very convincing argument to the POC user.

BTW, I tested insert select from a RisingWave table to another RisingWave table last week. Can we just compare the streaming load from Kafka to a table with insert select from a table to another table?

I can try to communicate this first. The closer to the user's real use case, the better, but definitely nothing wrong if we use what we have at the moment, could you post the link to the last week's results? 🙏 I am also thinking of adding this to the daily performance tests

chenzl25 commented 6 months ago

I can try to communicate this first. The closer to the user's real use case, the better, but definitely nothing wrong if we use what we have at the moment, could you post the link to the last week's results? 🙏 I am also thinking of adding this to the daily performance tests

https://github.com/risingwavelabs/risingwave/pull/14630#issuecomment-1895954674

chenzl25 commented 5 days ago

https://github.com/risingwavelabs/risingwave/issues/15784