Add output support for Arrow Flight

ozanminez commented 3 years ago

Recently I have looked at that topic for a little bit. These are my findings for now and I will continue to research.

According to Apache Arrow project’s Implementation Status page: https://arrow.apache.org/docs/status.html In summary, it says: There are native implementations for different languages but C++ implementation is the main one which is actively developed. (According to project’s Jira page: https://issues.apache.org/jira/projects/ARROW/issues/) Python implementation is a wrapper which has bindings to C++ implementation. So it follows the changes in C++ implementation.

Go implementation is another native and separately written one. It was recently (2 or 3 years ago) donated by the folks from InfluxData. In my opinion, It seems like it is not actively developed. It has only primitive Memory Data Types and Memory Object Sharing between processes in the same machine which is called IPC (Inter Process Communication). Go implementation has only ability for CSV File Format read and write. Only C++ implementation (and Python because they are exactly the same) has the ability for Parquet File Type read and write.

So currently we have no native implementation in Go for Parquet File and Flight RPC.

There is a current effort for the Parquet File Go Implementation according to this Jira issue: https://issues.apache.org/jira/browse/ARROW-7905 And this is the discussion for that issue: https://discuss.ossdata.org/t/go-apache-arrow-parquet/87 It started in Feb 2020 and it is still ongoing and it has a last commit 4 months ago and has no pull requests yet. This is the repo: https://github.com/nickpoorman/arrow/tree/ARROW-7905-go-parquet/go/parquet

There is also some effort for the Flight RPC. It started in Apr 2020 and it has one old work in progress and one ready pull request waiting for the merge which is last updated this week. https://issues.apache.org/jira/browse/ARROW-8601

There is an option of binding to the C++ implementation in Go. And there is a ready to use project called CArrow for that. This is the repo: https://github.com/353solutions/carrow

So there is a choice for waiting for the native Go implementations to be ready or calling C++ code in Go today with the CArrow project.

And additionally I have to mention this for the future: As I understand poorly; Apache Arrow project mainly focuses on memory management first. This is the main powerful point of that project and it is called “The physical layer” in their documentation: https://arrow.apache.org/docs/cpp/overview.html And there are also other layers on top of that memory management model. The physical layer (Memory Management) The one-dimensional layer (Data types, Arrays, Chunked arrays) The two-dimensional layer (Schemas, Tables, Record batches) The compute layer (Datums, Kernels) The IO layer (Streams) The Inter-Process Communication (IPC) layer (Messaging format) The file formats layer (Parquet, CSV, Orc) The devices layer (CUDA) The filesystem layer (Local file system, S3) Apache Arrow Flight (For sharing data and messages over the network with server and client)

After a quick look at our current Obslytics code, We have roughly Input (Store API), Dataframe (Memory Object) and Output (Writer) object layers in particular order. I am saying these and sentences below for the only Parquet File creation scenario. Apache Arrow Flight is a completely different scenario because it is focused on data sharing over network not File creation. So according to our and Apache Arrow’s layers, It will be better if as early as possible convert to the Arrow Memory object and keep everything after that in the Arrow layers path. We can start with Writer level in our code as an experiment but I think it will not be memory efficient as possible. Dataframe + Writer will be better but in that case we will miss the Store API parts. For the far future, and I know it seems impossible and needs lots of effort, maybe we can discuss Store API to use Arrow Memory format.

I am sharing this comment to share the current picture from my point of view. I will continue to watch the updates for the Arrow project in the Apache community and experiment on the use and integration with the Arrow and Arrow Flight.

bwplotka commented 3 years ago

Perfect thanks for this.

It's currently hard or impossible to move Memory Arrow frame between processes natively. That's why Flight is helpful here.

Apache Arrow Flight is a completely different scenario because it is focused on data sharing over network not File creation.

True, but think about different usage cases. Someone can totally run python app with arrow lib and run gRPC client against Flight endpoint. This will do the call to Obslytics which will convert data from Prometheus/Thanos efficiently.

It will do the work for first iteration and allow to use other integrations like panda, spark etc from Apache Arrow memory model which constructed data directly from gRPC flight, no? (:

ozanminez commented 3 years ago

Yes, I am aware of the Arrow Flight's purpose and use case but I focused to Parquet File creation (and so the Memory Format) in my previous comment. So I think we can say, for now, we will skip searching for a more efficient way of Parquet File creation with standard Apache Arrow, and we want to focus to the Apache Arrow Flight server/client remote data sharing over the network use case, right?

ozanminez commented 3 years ago

Good news :)

Arrow Flight Go implementation is just merged a couple days ago. I will experiment with it soon.

https://issues.apache.org/jira/browse/ARROW-8601 https://github.com/apache/arrow/pull/8175

thanos-community / obslytics

Add output support for Arrow Flight #8