Closed elbinpallimalilibm closed 3 months ago
Thanks for the proposal @elbinpallimalilibm , I think it sound like a useful addition. You hit on the main benefits already, with Flight being a standard RPC protocol, that would open up Presto to connect with any data source that is reachable by a Flight endpoint. Performance-wise, there can definitely be benefits, such as connecting to a PostgreSQL Flight endpoint where data is transferred in Arrow format. However, there are also cases where there might not be any benefit, e.g. connecting with a JDBC-based data source there will be overhead converting rows to columns.
Some questions:
Ah, so (1) this is for Presto consuming data from other sources, not Presto serving Arrow data over Flight and (2) this is for generic Flight and not Flight SQL?
Since, I believe services advertising Flight support are generally doing so via Flight SQL (examples: InfluxDB, Dremio, and the PostgreSQL adapter). Flight SQL also is an actual service definition vs Flight which is more a grab bag of patterns and suggestions for building a service.
Thanks for the proposal @elbinpallimalilibm , I think it sound like a useful addition. You hit on the main benefits already, with Flight being a standard RPC protocol, that would open up Presto to connect with any data source that is reachable by a Flight endpoint. Performance-wise, there can definitely be benefits, such as connecting to a PostgreSQL Flight endpoint where data is transferred in Arrow format. However, there are also cases where there might not be any benefit, e.g. connecting with a JDBC-based data source there will be overhead converting rows to columns.
Some questions:
- Will this include other SQL operations besides reading data, such as update, delete, etc?
- I think there needs to be more details about the FlightDescriptor. I see you mention JSON, however that puts a restriction on the Flight server to accept that format. There is also a lot that would need to be defined in order to use that command to implement the Presto connector interfaces, for example how will list tables?
- Have you considered using a FlightSQL client? The FlightSQL protocol defines a way for a client to interact with SQL-based data sources, so this provides a standard with commands to list tables, etc. already baked in. The downside is the Flight server would need to implement FlightSQL, not just core Flight.
Regarding FlightSQL:
My read on this proposal is it's not really a connector, but a template to build a connector (similar to Presto's existing JDBC connector). Because Flight is not an actual service definition as pointed out by @lidavidm, you need a template to fill in the blanks on how each particular Flight server chooses to expose its metadata.
While this limits the utility of connecting to existing systems, it still solves some problems.
1) Scheduling and execution remain essentially the same among FlightSQL and Flight. So we solve the problem of how to distribute the fetching of data among Presto workers. 2) There are services that expose their metadata through Flight, but without leveraging FlightSQL. Presto connectors often integrate with things that aren't SQL-like, like service endpoints. This template allows developers to connect to these sources in a straightforward manner. 3) Because FlightSQL is a service definition built on top of Flight, we can build a more specialized FlightSQL connector on top of this template. So in a sense, this gets us part of the way there with FlightSQL.
So in short, I think there's value in starting to support Flight generically (as proposed here), with richer support for FlightSQL being added in a subsequent initiative. Thoughts?
Ok, that sounds reasonable to me. (...I'd like to get away from Flight itself being such a heavy wrapper over things, but that's a long ways away.)
@elharo
High level meta comment: we'd like to move some of the existing connectors out of the main repo, and in general make builds more modular. This might make a good test project for that goal. Could this connector be implemented in a completely separate repository? If not, why not?
Just for your information, this is something we've discussed before and there's already alignment on pursuing this. I think we just need someone to spend the effort to move plugins out of the main repo and setup build pipelines for each. That being said, I don't think it's necessary or advisable to couple these two initiatives, as I think this effort deserves its own focus.
- This proposal is only for reading data. Insert, update, delete are not supported.
Just curious, is the limitation here that Presto's "connectors" are specifically for Presto connecting to other data sources rather than for consumers interacting with Presto?
Would it make sense to have a separate proposal to add Arrow FlightSQL as a way for users to interact with Presto itself and retrieve column-oriented data to avoid the column->row overhead? (Correct me if i'm wrong, Presto is internally columnar for execution, right?)
@zeroshade as you mentioned, I think this proposal is for Presto itself to interact with Arrow-like systems. In the future, we could implement support for updates and deletes on top of this proposal.
A separate thing altogether is Arrow FlightSQL support to supplant, or replace, Presto's RESTful client protocol. This has been discussed before (see https://github.com/prestodb/presto/issues/19419), and I believe there's alignment to support that, it just needs someone to pick up this work and come up with a comprehensive design. That ought to be separate RFC.
What this client-oriented RFC would buy us is, client's wouldn't have to go through the inconvenience of translating row-oriented JSON into Arrow buffers, however I don't think there would be a major performance improvement by doing this, because Presto's client architecture is already bottlenecked by the a single coordinator process, which limits parallelism and makes Presto not very suitable for very large reads. A larger effort might entail perhaps moving fetching from the coordinator to workers, to give the client greater flexibility in the level of parallelism on fetching the data. This effort of increasing parallelism for the reads has long been planned. As part of an effort to move toward Arrow Flight, this might entail exposing Tickets for each worker.
I hope this context is helpful.
High level meta comment: we'd like to move some of the existing connectors out of the main repo, and in general make builds more modular. This might make a good test project for that goal. Could this connector be implemented in a completely separate repository? If not, why not?
Second meta comment, especially if this connector does need to belong to this repo: presto has noticeable problems with connectors that were contributed and then abandoned by their initial implementers. How much support does this have by whom? How long can we count on that support?
IBM will continue to support and maintain the Arrow Flight connector.
Thanks for this RFC. Would be great if you add more detail about Velox/Prestissimo design as well.
Yes, we are working on the details for Prestissimo. Will add that here.
Added Prestissimo implementation details as well.
CC: @lidavidm would you or others be able to provide feedback on this approach from an Arrow perspective?