Open ArvinDevel opened 5 years ago
I'm curious, why cannot you use other open format, like ORC or Parquet?
Because I need to transfer lots of column data (say like one stream) to user quickly. ORC and Parquet as file format, the user should extract data from it, which will cause more overhead. And before I transfer data, I need to create specific file which will be too much to bear. I choose Arrow because the actual user wanted data is all in the arrow plain vector data. And arrow is developed as one neutral memory format which becomes more popular.
So you want to import data to presto or export from presto?
I want to import data to presto, so the user can use presto to analysis.
And you don't want to stage data in the file system in other format, right?
How then your data is exposed? How Presto can read it?
In my scenario, it's better to provide the arrow message to outside. After import the arrow data, the user has full freedom to choose other format which is more suitable for final result. And if presto could has interface to expose arrow data, that would be more attractive.
it's better to provide the arrow message to outside.
What do you mean? Does Apache Arrow exposes some service (like REST?) from where Presto could fetch data?
Would support for Arrow format in Hive connector be sufficient?
Actually for reading the data from Arrow we just need the memory address and arrow stores the data in off heap so it should be suffice to read the tables data from Arrow's format
@kokosing I think Apache Arrow doesn't expose some REST like service currently, but it can expose columnar data stream. We can operate the ArrowRecordBatch or Vector.
Would support for Arrow format in Hive connector be sufficient?
I'm not familiar with Hive connector, but I think the ORC file is the default format in Hive. And ORC is one columnar file format like parquet, so I think this is different from that.
I have been thinking about this for sometime, Apache Arrow is in-memory data format so in itself it is hard to imagine how Presto can leverage that (Thinking in connector context). The only scenario I can imagine is where you have separate program (Something like spark) running on same Presto cluster generating data in Arrow memory buffers and Presto directly accessing these as data source. Am I missing anything ?
The only scenario I can imagine is where you have separate program (Something like spark) running on same Presto cluster generating data in Arrow memory buffers and Presto directly accessing these as data source.
I think this scenario is beyond connector context, which can use Apache Arrow's IPC mechanism to implement.
As for arrow connector, it's more like one connector which receives arrow message from other data store and convert it to presto's internal data structure to calculate since presto use hisself internal data representation. And I hope presto can use arrow's vector to accelerate calculate using SIMD, vectorized calculation and so on at last . In the Arrow stream scenario, we can imagine one arrow vector/recordbatch message as mini batch of rows which is organized as column format. So it's more like the kafka connector, and we want to to expose arrow through data store like Apache BookKeeper or DistributedLog systems.
As Apache Arrow is coming up on a 1.0 release and their IPC format will ostensibly stabilize with a canonical on-disk representation (this is my current understanding, though 1.0 is not out yet and this has not been 100% confirmed), could the viability of this issue be revisited?
The advantage of the Arrow format for storage, as I see it, is that it is directly mappable from disk and fairly performant to work with once mapped. This allows for larger-than-memory datasets to be worked on with relative ease (presuming a subset is actually worked with for a given operation),
Any new development?
Arrow Flight is the way to go. It's an API design based on Google gRPC and Protobuf and allows parallel connections to the worker nodes. Snowflake and Dremio already support it. Would be great if Presto could do the same. We have Spark jobs reading data from Presto which definitely would benefit from such an implementation. A first version of Spark Arrow Flight connector is already available.
@sopel39 asked: Would support for Arrow format in Hive connector be sufficient?
Yes. My two cents in my scenario that would make sense as the standard pipeline tools like Flink or Spark can write in their IPC format directly to disk and having Presto through the hive connector be able to read this directly mappable to memory would see some great improvements.
This is actually something we've been hoping to see from the Presto community as there is more and more adoption of other tools moving forward with Apache Arrow.
@ckdarby I don't work on this repo anymore. Please file an issue in https://github.com/prestosql/presto
So forking and rebranding aside...
Does trinodb or prestodb support an arrow flight interface for querying data? Looking over trino docs all I see is JDBC or the CLI. I have an arrow flight client and I'd like to query results in arrow format.
Does trinodb or prestodb support an arrow flight interface for querying data?
@WilliamWhispell You will surely get answer for prestodb here. You can ask question about Trino on Trino project slack.
Thanks @findepi
Does anyone know if prestodb support's an arrow flight interface for querying data?
Just want to bring back the discussion for the Arrow connector. Yes, more and more systems start supporting Arrow. There are even Arrow native storage systems like SkyHookDM is emerging. I like to hear the concerns of adding support for Presto to use Arrow data source.
There is also a relatively new querying API for AWS LakeFormation which would be interesting to support.
This is a use case I am interested as well. This has huge value when it comes to industry data
@tdcmeehan So the issue tagged is the other direction - Presto exposing arrow interface
This issue is to allow presto to connect to arrow
If you are ok - I can come back with a PR for the connector?
Please, let's first discuss the design. You can book a time on my calendar. Thank you!
I want to expose the column data through arrow format, and then deal it use analytics tools. Is there any plan to ingest column data from apache arrow? And further more, to use vector as memory columnar data representation to overcome converting multiple format overhead? Any point view is appreciated.