[Feature Request]: Converting the query result to Apache Arrow format in server side.

JinHai-CN commented 1 month ago

Is there an existing issue for the same feature request?

[X] I have checked the existing issues.

Is your feature request related to a problem?

No response

Describe the feature you'd like

Currently, query results are stored in memory in a columnar format. However, the client expects the results in Apache Arrow format. At the moment, the format conversion is executed on the Python client, but this worsens the performance, so we plan to convert the results to Apache Arrow format on the server side before sending them to the client.

Describe implementation you've considered

No response

Documentation, adoption, use case

No response

Additional information

No response

niebayes commented 1 month ago

@JinHai-CN Hi, I did some investigation on the codebase, mainly focusing on the protocol layer and execution engine of Infinity. Regarding this issue, I have some questions:

Are we planning to use Arrow in-memory format to replace the current in-memory columnar format, or do we only need to support converting the in-memory columnar format to Arrow in-memory format while returning the query result?
Why does our client need to convert query results to Arrow format? Is it for interacting with Infinity using Python SDK in applications integrated with Arrow? I noticed that our examples and tests barely use the to_arrow method.
Why does converting to Arrow format on the client side degrade performance? Here is my understanding:
- The conversion from Pandas DataFrame or Polar DataFrame to Arrow format requires serialization and deserialization of metadata and data.
- Conversely, if we serialize query result to Thrift or HTTP response using Arrow IPC protocol on the server side, and then deserialize the result using Arrow IPC on the client side, we only pay the cost of serializing and deserializing the metadata since Arrow IPC protocol does not need to serialize the deserialize the data.
- In fact, serialization and deserialization overhead for metadata in Arrow IPC protocol can be eliminated in certain scenarios. I have implemented this optimization for GreptimeDB before.
- I am not sure if Thrift and HTTP protocols themselves perform additional serialization and deserialization. If we can support the Arrow Flight protocol, we do not need to worry about this overhead.
Is it mandatory for the server to convert query results to Arrow format? Is this conversion a default behavior, or is it optional? If optional, is it configured via server config, or controlled through options sent in client requests?
Do all protocols need to convert query results to Arrow? Does the Query Result in the Embedded API need conversion? Where is this conversion performed for the Embedded API?
What is the recommended way to implement such a conversion? How about adding a pluggable middleware in the server side?

JinHai-CN commented 1 month ago

@JinHai-CN Hi, I did some investigation on the codebase, mainly focusing on the protocol layer and execution engine of Infinity. Regarding this issue, I have some questions:

Are we planning to use Arrow in-memory format to replace the current in-memory columnar format, or do we only need to support converting the in-memory columnar format to Arrow in-memory format while returning the query result?

The plan has two steps:

Uses arrow format as the query result and transfer it to the client.
Replaces the in-memory columnar format with Arrow format.

Why does our client need to convert query results to Arrow format? Is it for interacting with Infinity using Python SDK in applications integrated with Arrow? I noticed that our examples and tests barely use the to_arrow method.

Yes, the examples are barely use to_arrow method. But arrow or data-frame are massively used in most production environments.

Why does converting to Arrow format on the client side degrade performance? Here is my understanding:

The conversion from Pandas DataFrame or Polar DataFrame to Arrow format requires serialization and deserialization of metadata and data.

Conversely, if we serialize query result to Thrift or HTTP response using Arrow IPC protocol on the server side, and then deserialize the result using Arrow IPC on the client side, we only pay the cost of serializing and deserializing the metadata since Arrow IPC protocol does not need to serialize the deserialize the data.

In fact, serialization and deserialization overhead for metadata in Arrow IPC protocol can be eliminated in certain scenarios. I have implemented this optimization for GreptimeDB before.

I am not sure if Thrift and HTTP protocols themselves perform additional serialization and deserialization. If we can support the Arrow Flight protocol, we do not need to worry about this overhead.

Yes, your understanding is correct. We plan to support Arrow flight protocol to eliminate the serialization and deserialization cost.

And the HTTP API didn't use thrift in the past, and won't use Arrow flight protocol in the future.

Is it mandatory for the server to convert query results to Arrow format? Is this conversion a default behavior, or is it optional? If optional, is it configured via server config, or controlled through options sent in client requests?

Mandatory.

Do all protocols need to convert query results to Arrow? Does the Query Result in the Embedded API need conversion? Where is this conversion performed for the Embedded API?

Excepts HTTP API, it shall works on all RPC protocols and Embedded API of python SDK.

What is the recommended way to implement such a conversion? How about adding a pluggable middleware in the server side?

Adds an Arrow conversion layer in server side.
Add Arrow flight protocol to infinity server and python client.
If all function works, benchmark result is good and all test cases are passed, apache thrift related code can be removed.

niebayes commented 1 month ago

@JinHai-CN Please assign me. I will start by creating a PoC to add support for converting to the Arrow format in the server side and for integrating the Arrow Flight protocol in both sides.

infiniflow / infinity