Leverage Apache Arrow to boost performance and embrace warehouse/big data ecosystems

zz-jason commented 3 years ago

Feature Request

Is your feature request related to a problem? Please describe:

TiDB adopts a vectorized execution engine method to improve query performance. Data is organized in a columnar data structure called Chunk, which is a columnar layout like Apache Arrow.

The problems for us are: We have to maintain Chunk ourselves.

If we can leverage Arrow in TiDB, we can build the library and ecosystem together with the Arrow community.

Describe the feature you'd like:

Support Apache Arrow, there are several use cases as described in Arrow use case document:

Reading/writing columnar storage formats: enables efficient data federation.
Sharing memory locally: enables TiDB to efficiently work with data bigger than the available memory.
Moving data over the network: exchange data between TiDB and TiKV Coprocessor.
In-memory data structure for analytics: TiDB execution engine and TiKV Coprocessor.

It's strongly recommended to also support Arrow in TiFlash.

Describe alternatives you've considered:

N/A

Teachability, Documentation, Adoption, Migration Strategy:

Things need to consider after Apache Arrow is supported in TiDB:

When to update the Arrow dependency?
What tests do we need to safely update the Arrow dependency?

zz-jason commented 3 years ago

ilovesoup commented 3 years ago

We tested arrow on tiflash before and it burned quite a lot CPU and we abandoned it. If it is to connect to external eco system I suggest having extra interfaces to do it and only use arrow on the border of these interfaces. Binding internal representation to arrow or using it btw internal nodes do not sound a good idea to me. It will largely prevent you from further internal optimizations inside your engine (unless we plan no more in the future). If you really decide to go forward, maybe a benchmark or prototype is needed.

innerr commented 3 years ago

In my experience, Arrow is not fit for data transmission, it's too heavy on encoding (people can easily just focus on the bright side, the decoding).

Back in TiFlash @ilovesoup mentioned above, when transmiting about 900M/s data between TiFlash and TiSpark, Arrow encoding ate up 12 cores, in the TPCH benchmark.

xhochy commented 3 years ago

@ilovesoup @innerr can you enlighten me what you mean with encoding? I would accept that for a format like Parquet but in Arrow the data is quite "plain" in memory and the only thing that would roughly be some sort of encoding would be the validity bitmap. That is though extremely optimized at least in the C++ implementation.

pingcap / tidb

Leverage Apache Arrow to boost performance and embrace warehouse/big data ecosystems #21186

Feature Request