pingcap / tidb

TiDB is an open-source, cloud-native, distributed, MySQL-Compatible database for elastic scale and real-time analytics. Try AI-powered Chat2Query free at : https://www.pingcap.com/tidb-serverless/
https://pingcap.com
Apache License 2.0
36.5k stars 5.74k forks source link

Apache Arrow Flight SQL connector #21056

Open backkem opened 3 years ago

backkem commented 3 years ago

Feature Request

Describe the feature you'd like: Hi all, I'm wondering if there would be interest in supporting an Apache Arrow Flight connector. This transport can enable faster data retrieval and higher throughput by reducing (de)serialization and data copying. It also gives you nice strict typing on query results. Further down the line it may also allow distributing of query processing without relaying the query results through a coordinator node.

Describe alternatives you've considered: None

Teachability, Documentation, Adoption, Migration Strategy: Apache Arrow Flight is a well documented protocol with implementations across many languages. Naturally, this would be an additive feature, next to the existing MySQL/ODBC Connector.

winkyao commented 3 years ago

@backkem hi, do you mean that TiDB supports Apache Arrow Flight as a new client to replace the MySQL protocol?

winkyao commented 3 years ago

And do you have any interest to develop this feature with us?

backkem commented 3 years ago

Hi @winkyao. Yes, but I would add it in addition to the MySQL protocol. The MySQL protocol has broad support in existing tooling, therefore I would not phase it out. However, if you're writing a new service and want increased performance, or any of the other benefits, you could switch over to the Apache Arrow Flight protocol. The Flight connector could also be more 'native' if the data is stored in the Apache Arrow format. I read it may already closely resemble it.

I'd love to help you build this but I'll have to find the time. Especially, to land it on my own. It may also be a good idea to reach out to the Arrow community as well, E.g.: I'm not sure if they have a standard for querying yet.

winkyao commented 3 years ago

@backkem Thanks for your suggestion. We will try to reach out to the Arrow community and find a way to cooperate with them.

backkem commented 3 years ago

I reached out and found out there is work being done on a Flight SQL Proposal:

I'm sure some of the contributors here can provide valuable feedback on the design. Maybe we can experiment with an implementation as well.

winkyao commented 3 years ago

@zz-jason Could you please take a look at these designs?

zz-jason commented 3 years ago

@backkem Thank you for your suggestion!

In TiDB:

After reading the blog about Arrow Flight and the proposal about Arrow Flight SQL, maybe we could:

zz-jason commented 3 years ago

From the Implementation Status in Apache Arrow, seems we need to support Flight RPC for go firstly.

backkem commented 2 years ago

The first implementations of Flight SQL are shipping in Arrow 7.0.0: article. This doesn't include a Go port yet thought.

backkem commented 1 year ago

It looks like InfluxDB now has full support for Flight SQL: blog post.

backkem commented 9 months ago

Linking the Go Flight SQL package and server implementation example.

backkem commented 8 months ago

One major difference between the current MySQL connector and Arrow Flight is that the former is connection based and the latter uses a more stateless request/response design (gRPC).

Looking at the code, this may mean it makes more sense not the use the current session implementation and create a separate implementation that uses the Parser / Compiler / Executor directly. That being said, the RecordSet is already closely inspired by Apache Arrow.

backkem commented 8 months ago

Looking into it more, both the Compiler and Executor have a significant dependency on the sessionctx. This would either have to be unraveled or an ephemeral session could be created.

backkem commented 5 months ago

I created a very basic POC for this. You can find the code here: