timeplus-io / proton

A stream processing engine and database, and a fast and lightweight alternative to ksqlDB and Apache Flink, 🚀 powered by ClickHouse
https://timeplus.com
Apache License 2.0
1.58k stars 69 forks source link

Add ADBC driver support for Arrow Flight SQL #856

Closed vishwamartur closed 3 days ago

vishwamartur commented 1 week ago

Related to #276

Add support for ADBC (Arrow Database Connectivity) driver for Arrow Flight SQL.

/claim #276

CLAassistant commented 1 week ago

CLA assistant check
All committers have signed the CLA.

algora-pbc[bot] commented 1 week ago

💵 To receive payouts, sign up on Algora, link your Github account and connect with Stripe.

jovezhong commented 1 week ago

Thanks for the PR. We will be reviewing it shortly.

jovezhong commented 6 days ago

Hi @vishwamartur,

Thanks for the PR. I am checking with our engineering team to see who will be the best person to look into the implementation details. What I am expecting

I will arrange some blog/video around the ADBC/Arrow support when the PR is merged.

Hope it makes sense, and feel free to let us know your thoughts.

vishwamartur commented 6 days ago

Hi @jovezhong,

Thank you for the detailed feedback and suggestions!

To start, we’d like to focus on fully implementing and stabilizing the ADBC driver support in C++. Once the C++ implementation is complete and meets the required performance and functionality benchmarks (e.g., large result set handling, streaming SQL), we can then plan to extend support to other languages like Go, Java, Python, and R.

This phased approach will allow us to ensure a solid foundation before expanding to other ecosystems. Let me know if this sounds good, or if you have any immediate priorities that require parallel development in other languages.

Thanks! @vishwamartur

jovezhong commented 6 days ago

Sounds good. Let's have the C++ driver has the 1st feature-complete ADBC driver, then expand to more languages. From high priority to lower: C++ > Java > Python > Go. You don't need to work on R adapter. Ideally we contribute the ADBC driver for Timeplus, similar to https://arrow.apache.org/adbc/current/driver/postgresql.html

zeroshade commented 6 days ago

Looking at this, this doesn't appear to actually have much to do with ADBC in anything but name. Does Timeplus already support Arrow FlightSQL? If so, then there's nothing that needs to be done as all of the ADBC bindings would be able to use the FlightSQL driver to connect query data from any one of multiple languages (Go, C++, C, Python, R, Rust, Java, etc.)

If Timeplus doesn't already support FlightSQL, then you need to implement the ADBC C interface to create a driver, ideally as a shared object library that can be separately distributed as a client rather than built into Timeplus directly. I can help with that if needed.

jovezhong commented 6 days ago

Thanks Matt for the comment. Today in Timeplus Proton server we don't have FlightSQL built-in. I leave more discussions between you and @vishwamartur

To be clear, we want ADBC support more than FlightSQL.

zeroshade commented 6 days ago

I just want to clarify: @vishwamartur is the goal here to have an ADBC driver to connect to Time plus with? Or for Time plus to connect to other sources via ADBC? That will affect what is expected to be implemented here.

zeroshade commented 6 days ago

@jovezhong i just to be clear, if Timeplus exposes a Flight SQL server for connectivity, you would get ADBC support for free via the flight SQL ADBC (and ODBC/JDBC) driver that already exists.

That said, I believe you already are built on ClickHouse, so it shouldn't be too difficult to create an ADBC driver which can use the ClickHouse protocol for connecting and retrieving Arrow formatted data, right?

vishwamartur commented 6 days ago

Hi @zeroshade,

Thanks for the clarification! The goal is to create an ADBC driver for clients to connect to Timeplus. Leveraging the ClickHouse protocol to retrieve Arrow-formatted data makes sense, given our architecture.

If you have any specific suggestions for implementing the ADBC C interface or designing the driver as a shared library, I’d greatly appreciate it.

Looking forward to your thoughts!

Best,
Vishwa

zliang-min commented 6 days ago

@vishwamartur I might have missed something, but looking at the PR, I don't see how this can let someone create a ADBC driver to connect to timeplus proton. Could you help me to understand how this works, please?

zliang-min commented 6 days ago

@jovezhong i just to be clear, if Timeplus exposes a Flight SQL server for connectivity, you would get ADBC support for free via the flight SQL ADBC (and ODBC/JDBC) driver that already exists.

That said, I believe you already are built on ClickHouse, so it shouldn't be too difficult to create an ADBC driver which can use the ClickHouse protocol for connecting and retrieving Arrow formatted data, right?

@zeroshade yes, proton also has the arrow format support as ClickHouse does, but there are gaps as the implementations are not up-to-date with the ClickHouse repo at the moment. This might or might not have impact on implementing an ADBC driver ( I don't now much about implementing an ADBC driver ). I don't know if ADBC interface supports streaming already, since proton is a streaming data engine, this is one thing to pay attention to when implementing a database driver for it.

vishwamartur commented 5 days ago

image

@zeroshade, could you please suggest any changes?

@zliang-min, if I’m mistaken, I would appreciate your guidance and suggestions for improvements. I’ll do my best to implement them.

zliang-min commented 5 days ago

@vishwamartur to achieve the goal of being able to connect to timeplus proton via an ADBC driver, there are two options:

The second option allows the maximum availability and makes it easier to integrate with the existing ecosystem. The first option is probably easier, but it has big limitations ( it limits what languages can be used, and it's hard to utilize what are already there in the ecosystem ).

Hopefully this makes sense.

zeroshade commented 5 days ago

The second option allows the maximum availability and makes it easier to integrate with the existing ecosystem. The first option is probably easier, but it has big limitations ( it limits what languages can be used, and it's hard to utilize what are already there in the ecosystem ).

It actually doesn't limit the languages as much as you'd expect. For example, the current ADBC FlightSQL driver is implemented in Go and distributed as a C shared object that can be loaded by ADBC driver managers. If you implement the Go ADBC Interface, then it's a simple case to use the existing SDK to create a distributable driver that can be easily loaded by any ADBC driver manager.

@zeroshade, could you please suggest any changes?

I would argue that ADBC Driver belongs in the same box as SDK, JDBC/ODBC and Data/BI Connectors. An ADBC driver is just another driver, similar in concept to a JDBC or ODBC driver (but columnar and Arrow-native instead of row-oriented).

vishwamartur commented 2 hours ago

I’ve made the changes in this pull request. Could you please review them and share your suggestions? I’m happy to make any necessary updates.