Open jovezhong opened 10 months ago
/bounty $500
/attempt #505
with your implementation plan/claim #505
in the PR body to claim the bounty/livestream
once liveThank you for contributing to timeplus-io/proton!
Add a bounty • Share on socials
Attempt | Started (GMT+0) | Solution |
---|---|---|
🟢 @0xa48rx394r83e9 | May 19, 2024, 12:37:04 AM | WIP |
Do you have an example of External Streams/Tables for read/write to something?
Hi @tgprj , you can check this PR for external table to ClickHouse for read/write https://github.com/timeplus-io/proton/pull/546 This will be merged in next couple hours/days. Feel free to take that as a reference. For streaming read/write, we prefer using external stream
. For batch read/write, make them as external table
@jovezhong do you only insert to your external table / stream? or do you also update and delete?
Currently no UPDATE or DELETE in Proton external stream or external table.
The main use case to read all data or new data from other systems, or write Proton SQL results to target systems. No need to handle the case to send a DELETE, UPDATE, or UPSERT to target system. But if you can do so as an advanced feature, that'll be wonderful for sure.
I can either use the Python library or the Java library, both from Apache.
Which do you prefer?
We'd like to have a pure C++ implementation, not sure whether it's possible with the REST catalog. Proton is designed to be a single binary, without 3rd party dependencies.
Writing a client in C++ requires deep knowledge of iceberg, and it would take some amount of time to do correctly, definitely more time than the current bounty can afford.
Hi @tgprj , we understand having a pure C++ implementation of Iceberg client is not an easy task. At this moment, this is a nice-to-have feature, so we'd like to invite the community to have a try, with some bounty as a token to appreciate the effort.
Seems interesting. I have experience with c++ so should be doable.
/attempt #505
Look forward to your contribution, @0xa48rx394r83e9 Feel free to share any question/progress in this ticket. Or join our community slack https://timeplus.com/slack
Today I'm planning to get boiler stuff in place; tomorrow, I will provide a progress report and ask any clarifying questions that arise. (Mainly because I'll be able to work much more on this tomorrow.)
Okay, after doing preliminary research into the proton source and reading about Apache Iceberg, I have a few clarifying questions:
How would you like me to proceed with this in terms of project structure? I'm guessing you'd want this under '/src/iceberg' (I'm basing this on the project source because things seem to have changed since #546 was merged).
Apologies for the spew of questions.
Hi @0xa48rx394r83e9 , you can start with Parquet as the first data format to support read and write. For the rest of C++ coding questions, I defer this to @zliang-min or @chenziliang . Please note May 18-20 is a long weekend in Canada. We may reply your question on May 21.
Just checking in on the last two clarifications. As of now, I've created a minimal implementation for parquet, but I'm waiting for those final two clarifications, as they will have a notable impact on how I proceed.
Hi @0xa48rx394r83e9, sorry for the late reponse. I would suggest to put all iceberg related code in src/Storages/ExternalTable/Iceberg
for now, we can relocate the code whenever we need to in the future.
src/Formats
probably does not contain anything to help with this feature. src/Processors/Formats
contains code for the parquet input/output format implementation, which could help. IO
does contain useful code for handling IO, like the read / write buffers, it's worth taking a look at. src/Client
is not a generic TCP client library, thus I don't think it will be useful in this case.
Use case
Essentially we are building the best streaming SQL engine for live data. Apache Kafka and API-compatiable streaming platforms are well supported in Proton and Timeplus Cloud.
Apache Iceberg is a popular open table format for analytic datasets. It will be great to have Proton support Apache Iceberg out of box:
We welcome the community to contribute more integrations for Proton with other systems. This ticket is specifically about the integration with Apache Iceberg via Proton External Stream or External Table.
Technical Challenges At this moment, there is no solid C++ library for Iceberg read/write. Whenever possible, we prefer a pure C++ implementation, without starting a new Java process