prestodb / presto

The official home of the Presto distributed SQL query engine for big data
http://prestodb.io
Apache License 2.0
15.74k stars 5.28k forks source link

Apache Arrow Connector #12201

Open ArvinDevel opened 5 years ago

ArvinDevel commented 5 years ago

I want to expose the column data through arrow format, and then deal it use analytics tools. Is there any plan to ingest column data from apache arrow? And further more, to use vector as memory columnar data representation to overcome converting multiple format overhead? Any point view is appreciated.

sopel39 commented 5 years ago

I'm curious, why cannot you use other open format, like ORC or Parquet?

ArvinDevel commented 5 years ago

Because I need to transfer lots of column data (say like one stream) to user quickly. ORC and Parquet as file format, the user should extract data from it, which will cause more overhead. And before I transfer data, I need to create specific file which will be too much to bear. I choose Arrow because the actual user wanted data is all in the arrow plain vector data. And arrow is developed as one neutral memory format which becomes more popular.

sopel39 commented 5 years ago

So you want to import data to presto or export from presto?

ArvinDevel commented 5 years ago

I want to import data to presto, so the user can use presto to analysis.

kokosing commented 5 years ago

And you don't want to stage data in the file system in other format, right?

How then your data is exposed? How Presto can read it?

ArvinDevel commented 5 years ago

In my scenario, it's better to provide the arrow message to outside. After import the arrow data, the user has full freedom to choose other format which is more suitable for final result. And if presto could has interface to expose arrow data, that would be more attractive.

kokosing commented 5 years ago

it's better to provide the arrow message to outside.

What do you mean? Does Apache Arrow exposes some service (like REST?) from where Presto could fetch data?

sopel39 commented 5 years ago

Would support for Arrow format in Hive connector be sufficient?

Praveen2112 commented 5 years ago

Actually for reading the data from Arrow we just need the memory address and arrow stores the data in off heap so it should be suffice to read the tables data from Arrow's format

ArvinDevel commented 5 years ago

@kokosing I think Apache Arrow doesn't expose some REST like service currently, but it can expose columnar data stream. We can operate the ArrowRecordBatch or Vector.

ArvinDevel commented 5 years ago

Would support for Arrow format in Hive connector be sufficient?

I'm not familiar with Hive connector, but I think the ORC file is the default format in Hive. And ORC is one columnar file format like parquet, so I think this is different from that.

kbkadu commented 5 years ago

I have been thinking about this for sometime, Apache Arrow is in-memory data format so in itself it is hard to imagine how Presto can leverage that (Thinking in connector context). The only scenario I can imagine is where you have separate program (Something like spark) running on same Presto cluster generating data in Arrow memory buffers and Presto directly accessing these as data source. Am I missing anything ?

ArvinDevel commented 5 years ago

The only scenario I can imagine is where you have separate program (Something like spark) running on same Presto cluster generating data in Arrow memory buffers and Presto directly accessing these as data source.

I think this scenario is beyond connector context, which can use Apache Arrow's IPC mechanism to implement.

As for arrow connector, it's more like one connector which receives arrow message from other data store and convert it to presto's internal data structure to calculate since presto use hisself internal data representation. And I hope presto can use arrow's vector to accelerate calculate using SIMD, vectorized calculation and so on at last . In the Arrow stream scenario, we can imagine one arrow vector/recordbatch message as mini batch of rows which is organized as column format. So it's more like the kafka connector, and we want to to expose arrow through data store like Apache BookKeeper or DistributedLog systems.

nugend commented 4 years ago

As Apache Arrow is coming up on a 1.0 release and their IPC format will ostensibly stabilize with a canonical on-disk representation (this is my current understanding, though 1.0 is not out yet and this has not been 100% confirmed), could the viability of this issue be revisited?

The advantage of the Arrow format for storage, as I see it, is that it is directly mappable from disk and fairly performant to work with once mapped. This allows for larger-than-memory datasets to be worked on with relative ease (presuming a subset is actually worked with for a given operation),

tooptoop4 commented 4 years ago

https://www.snowflake.com/blog/fetching-query-results-from-snowflake-just-got-a-lot-faster-with-apache-arrow/

yoctottaops commented 4 years ago

Any new development?

heroldus commented 3 years ago

Arrow Flight is the way to go. It's an API design based on Google gRPC and Protobuf and allows parallel connections to the worker nodes. Snowflake and Dremio already support it. Would be great if Presto could do the same. We have Spark jobs reading data from Presto which definitely would benefit from such an implementation. A first version of Spark Arrow Flight connector is already available.

ckdarby commented 3 years ago

@sopel39 asked: Would support for Arrow format in Hive connector be sufficient?

Yes. My two cents in my scenario that would make sense as the standard pipeline tools like Flink or Spark can write in their IPC format directly to disk and having Presto through the hive connector be able to read this directly mappable to memory would see some great improvements.

This is actually something we've been hoping to see from the Presto community as there is more and more adoption of other tools moving forward with Apache Arrow.

tooptoop4 commented 3 years ago

@ckdarby https://github.com/Praveen2112/presto/tree/arrow_connector/presto-arrow-flight/src/main/java/io/prestosql/plugin/arrow

sopel39 commented 3 years ago

@ckdarby I don't work on this repo anymore. Please file an issue in https://github.com/prestosql/presto

WilliamWhispell commented 3 years ago

So forking and rebranding aside...

Does trinodb or prestodb support an arrow flight interface for querying data? Looking over trino docs all I see is JDBC or the CLI. I have an arrow flight client and I'd like to query results in arrow format.

findepi commented 3 years ago

Does trinodb or prestodb support an arrow flight interface for querying data?

@WilliamWhispell You will surely get answer for prestodb here. You can ask question about Trino on Trino project slack.

WilliamWhispell commented 3 years ago

Thanks @findepi

Does anyone know if prestodb support's an arrow flight interface for querying data?

shangxinli commented 2 years ago

Just want to bring back the discussion for the Arrow connector. Yes, more and more systems start supporting Arrow. There are even Arrow native storage systems like SkyHookDM is emerging. I like to hear the concerns of adding support for Presto to use Arrow data source.

heroldus commented 2 years ago

There is also a relatively new querying API for AWS LakeFormation which would be interesting to support.

skairali commented 1 year ago

This is a use case I am interested as well. This has huge value when it comes to industry data

skairali commented 10 months ago

@tdcmeehan So the issue tagged is the other direction - Presto exposing arrow interface

This issue is to allow presto to connect to arrow

If you are ok - I can come back with a PR for the connector?

tdcmeehan commented 10 months ago

Please, let's first discuss the design. You can book a time on my calendar. Thank you!