timeplus-io / proton

A stream processing engine and database, and a fast and lightweight alternative to ksqlDB and Apache Flink, 🚀 powered by ClickHouse
https://timeplus.com
Apache License 2.0
1.57k stars 69 forks source link

External Streams/Tables to read/write Apache Iceberg #505

Open jovezhong opened 10 months ago

jovezhong commented 10 months ago

Use case

Essentially we are building the best streaming SQL engine for live data. Apache Kafka and API-compatiable streaming platforms are well supported in Proton and Timeplus Cloud.

Apache Iceberg is a popular open table format for analytic datasets. It will be great to have Proton support Apache Iceberg out of box:

We welcome the community to contribute more integrations for Proton with other systems. This ticket is specifically about the integration with Apache Iceberg via Proton External Stream or External Table.

Technical Challenges At this moment, there is no solid C++ library for Iceberg read/write. Whenever possible, we prefer a pure C++ implementation, without starting a new Java process

jovezhong commented 10 months ago

/bounty $500

algora-pbc[bot] commented 10 months ago

💎 $500 bounty • Timeplus

Steps to solve:

  1. Start working: Comment /attempt #505 with your implementation plan
  2. Submit work: Create a pull request including /claim #505 in the PR body to claim the bounty
  3. Receive payment: 100% of the bounty is received 2-5 days post-reward. Make sure you are eligible for payouts

Additional opportunities:

Thank you for contributing to timeplus-io/proton!

Add a bountyShare on socials

Attempt Started (GMT+0) Solution
🟢 @0xa48rx394r83e9 May 19, 2024, 12:37:04 AM WIP
tgprj commented 9 months ago

Do you have an example of External Streams/Tables for read/write to something?

jovezhong commented 9 months ago

Hi @tgprj , you can check this PR for external table to ClickHouse for read/write https://github.com/timeplus-io/proton/pull/546 This will be merged in next couple hours/days. Feel free to take that as a reference. For streaming read/write, we prefer using external stream. For batch read/write, make them as external table

tgprj commented 9 months ago

@jovezhong do you only insert to your external table / stream? or do you also update and delete?

jovezhong commented 9 months ago

Currently no UPDATE or DELETE in Proton external stream or external table.

The main use case to read all data or new data from other systems, or write Proton SQL results to target systems. No need to handle the case to send a DELETE, UPDATE, or UPSERT to target system. But if you can do so as an advanced feature, that'll be wonderful for sure.

tgprj commented 9 months ago

I can either use the Python library or the Java library, both from Apache.

tgprj commented 9 months ago

Which do you prefer?

jovezhong commented 9 months ago

We'd like to have a pure C++ implementation, not sure whether it's possible with the REST catalog. Proton is designed to be a single binary, without 3rd party dependencies.

tgprj commented 9 months ago

Writing a client in C++ requires deep knowledge of iceberg, and it would take some amount of time to do correctly, definitely more time than the current bounty can afford.

jovezhong commented 9 months ago

Hi @tgprj , we understand having a pure C++ implementation of Iceberg client is not an easy task. At this moment, this is a nice-to-have feature, so we'd like to invite the community to have a try, with some bounty as a token to appreciate the effort.

0xa48rx394r83e9 commented 6 months ago

Seems interesting. I have experience with c++ so should be doable.

0xa48rx394r83e9 commented 6 months ago

/attempt #505

jovezhong commented 6 months ago

Look forward to your contribution, @0xa48rx394r83e9 Feel free to share any question/progress in this ticket. Or join our community slack https://timeplus.com/slack

0xa48rx394r83e9 commented 6 months ago

Today I'm planning to get boiler stuff in place; tomorrow, I will provide a progress report and ask any clarifying questions that arise. (Mainly because I'll be able to work much more on this tomorrow.)

0xa48rx394r83e9 commented 6 months ago

Okay, after doing preliminary research into the proton source and reading about Apache Iceberg, I have a few clarifying questions:

  • What is the intended format from which we will read/write (Parquet, ORC, Avro, etc.), given that iceberg is a specification rather than a single file format? (Although, looking at 'FormatFactory.cpp', are you expecting all of them?)
  • How would you like me to proceed with this in terms of project structure? I'm guessing you'd want this under '/src/iceberg' (I'm basing this on the project source because things seem to have changed since #546 was merged).

    • Is there anything that has already been written that might be related to this (to avoid code redundancy)? I've noticed that '/Formats', '/IO', and '/Client' include elements related to this. Also, I am unable to verify if there is any documentation for these because the documentation site is currently not loading.

    Apologies for the spew of questions.

jovezhong commented 6 months ago

Hi @0xa48rx394r83e9 , you can start with Parquet as the first data format to support read and write. For the rest of C++ coding questions, I defer this to @zliang-min or @chenziliang . Please note May 18-20 is a long weekend in Canada. We may reply your question on May 21.

0xa48rx394r83e9 commented 5 months ago

Just checking in on the last two clarifications. As of now, I've created a minimal implementation for parquet, but I'm waiting for those final two clarifications, as they will have a notable impact on how I proceed.

zliang-min commented 5 months ago

Hi @0xa48rx394r83e9, sorry for the late reponse. I would suggest to put all iceberg related code in src/Storages/ExternalTable/Iceberg for now, we can relocate the code whenever we need to in the future.

src/Formats probably does not contain anything to help with this feature. src/Processors/Formats contains code for the parquet input/output format implementation, which could help. IO does contain useful code for handling IO, like the read / write buffers, it's worth taking a look at. src/Client is not a generic TCP client library, thus I don't think it will be useful in this case.