pingcap / tidb

TiDB is an open-source, cloud-native, distributed, MySQL-Compatible database for elastic scale and real-time analytics. Try AI-powered Chat2Query free at : https://www.pingcap.com/tidb-serverless/
https://pingcap.com
Apache License 2.0
36.87k stars 5.8k forks source link

Leverage Apache Arrow to boost performance and embrace warehouse/big data ecosystems #21186

Open zz-jason opened 3 years ago

zz-jason commented 3 years ago

Feature Request

Is your feature request related to a problem? Please describe:

TiDB adopts a vectorized execution engine method to improve query performance. Data is organized in a columnar data structure called Chunk, which is a columnar layout like Apache Arrow.

The problems for us are: We have to maintain Chunk ourselves.

If we can leverage Arrow in TiDB, we can build the library and ecosystem together with the Arrow community.

Describe the feature you'd like:

Support Apache Arrow, there are several use cases as described in Arrow use case document:

It's strongly recommended to also support Arrow in TiFlash.

Describe alternatives you've considered:

N/A

Teachability, Documentation, Adoption, Migration Strategy:

Things need to consider after Apache Arrow is supported in TiDB:

zz-jason commented 3 years ago

related to https://github.com/pingcap/tidb/issues/21056

ilovesoup commented 3 years ago

We tested arrow on tiflash before and it burned quite a lot CPU and we abandoned it. If it is to connect to external eco system I suggest having extra interfaces to do it and only use arrow on the border of these interfaces. Binding internal representation to arrow or using it btw internal nodes do not sound a good idea to me. It will largely prevent you from further internal optimizations inside your engine (unless we plan no more in the future). If you really decide to go forward, maybe a benchmark or prototype is needed.

innerr commented 3 years ago

In my experience, Arrow is not fit for data transmission, it's too heavy on encoding (people can easily just focus on the bright side, the decoding).

Back in TiFlash @ilovesoup mentioned above, when transmiting about 900M/s data between TiFlash and TiSpark, Arrow encoding ate up 12 cores, in the TPCH benchmark.

xhochy commented 3 years ago

@ilovesoup @innerr can you enlighten me what you mean with encoding? I would accept that for a format like Parquet but in Arrow the data is quite "plain" in memory and the only thing that would roughly be some sort of encoding would be the validity bitmap. That is though extremely optimized at least in the C++ implementation.