Open chenzl25 opened 10 months ago
Just try to provide the complete context, the details of the POC user request can be found: https://www.notion.so/risingwave-labs/optimize-parquet-source-for-batch-load-dc498a043d504621bf56461690b14bd7?d=84ebdf5d7469412680278059c5898be8
In short, if implementing the batch iceberg source takes much time due to its complexity, a parquet file source with decent performance is good enough to help move the POC forward. The user will consider switching RW only if RW's iceberg batch source is fast enough.
In short, if implementing the batch iceberg source takes much time due to its complexity, a parquet file source with decent performance is good enough to help move the POC forward. The user will consider switching RW only if RW's iceberg batch source is fast enough.
Must the file format be Parquet? Is it possible to use a CSV file that has been supported in our file source? If it is ok to test a CSV file first, we can support file source batch read first, to test the performance. BTW, I tested insert select from a RisingWave table to another RisingWave table last week. Can we just compare the streaming load from Kafka to a table with insert select from a table to another table?
Is it possible to use a CSV file that has been supported in our file source? If it is ok to test a CSV file first,
Good point, I think it is ok as CSV is a less efficient format than Parquet in terms of read and write performance. If we achieve decent enough performance for CSV files, we can be even faster when using Parquet. I guess it is a very convincing argument to the POC user.
BTW, I tested insert select from a RisingWave table to another RisingWave table last week. Can we just compare the streaming load from Kafka to a table with insert select from a table to another table?
I can try to communicate this first. The closer to the user's real use case, the better, but definitely nothing wrong if we use what we have at the moment, could you post the link to the last week's results? 🙏 I am also thinking of adding this to the daily performance tests
I can try to communicate this first. The closer to the user's real use case, the better, but definitely nothing wrong if we use what we have at the moment, could you post the link to the last week's results? 🙏 I am also thinking of adding this to the daily performance tests
https://github.com/risingwavelabs/risingwave/pull/14630#issuecomment-1895954674
Is your feature request related to a problem? Please describe.
According to RFC: Combine Historical and Incremental Data we need to support ingesting data from an external source (e.g. iceberg or file source) as historical data. This is a typical scenario of bulk loading which is expected to be faster than streaming loading data from a Kafka.
To support this feature, we need to
Performance Improvement:
Others:
Describe the solution you'd like
No response
Describe alternatives you've considered
No response
Additional context
No response