Open JinHai-CN opened 1 month ago
@JinHai-CN Hi, I did some investigation on the codebase, mainly focusing on the protocol layer and execution engine of Infinity. Regarding this issue, I have some questions:
to_arrow
method.@JinHai-CN Hi, I did some investigation on the codebase, mainly focusing on the protocol layer and execution engine of Infinity. Regarding this issue, I have some questions:
- Are we planning to use Arrow in-memory format to replace the current in-memory columnar format, or do we only need to support converting the in-memory columnar format to Arrow in-memory format while returning the query result?
The plan has two steps:
- Why does our client need to convert query results to Arrow format? Is it for interacting with Infinity using Python SDK in applications integrated with Arrow? I noticed that our examples and tests barely use the
to_arrow
method.
Yes, the examples are barely use to_arrow method. But arrow or data-frame are massively used in most production environments.
Why does converting to Arrow format on the client side degrade performance? Here is my understanding:
- The conversion from Pandas DataFrame or Polar DataFrame to Arrow format requires serialization and deserialization of metadata and data.
- Conversely, if we serialize query result to Thrift or HTTP response using Arrow IPC protocol on the server side, and then deserialize the result using Arrow IPC on the client side, we only pay the cost of serializing and deserializing the metadata since Arrow IPC protocol does not need to serialize the deserialize the data.
- In fact, serialization and deserialization overhead for metadata in Arrow IPC protocol can be eliminated in certain scenarios. I have implemented this optimization for GreptimeDB before.
- I am not sure if Thrift and HTTP protocols themselves perform additional serialization and deserialization. If we can support the Arrow Flight protocol, we do not need to worry about this overhead.
Yes, your understanding is correct. We plan to support Arrow flight protocol to eliminate the serialization and deserialization cost.
And the HTTP API didn't use thrift in the past, and won't use Arrow flight protocol in the future.
- Is it mandatory for the server to convert query results to Arrow format? Is this conversion a default behavior, or is it optional? If optional, is it configured via server config, or controlled through options sent in client requests?
Mandatory.
- Do all protocols need to convert query results to Arrow? Does the Query Result in the Embedded API need conversion? Where is this conversion performed for the Embedded API?
Excepts HTTP API, it shall works on all RPC protocols and Embedded API of python SDK.
- What is the recommended way to implement such a conversion? How about adding a pluggable middleware in the server side?
@JinHai-CN Please assign me. I will start by creating a PoC to add support for converting to the Arrow format in the server side and for integrating the Arrow Flight protocol in both sides.
Is there an existing issue for the same feature request?
Is your feature request related to a problem?
No response
Describe the feature you'd like
Currently, query results are stored in memory in a columnar format. However, the client expects the results in Apache Arrow format. At the moment, the format conversion is executed on the Python client, but this worsens the performance, so we plan to convert the results to Apache Arrow format on the server side before sending them to the client.
Describe implementation you've considered
No response
Documentation, adoption, use case
No response
Additional information
No response