Open zz-jason opened 3 years ago
related to https://github.com/pingcap/tidb/issues/21056
We tested arrow on tiflash before and it burned quite a lot CPU and we abandoned it. If it is to connect to external eco system I suggest having extra interfaces to do it and only use arrow on the border of these interfaces. Binding internal representation to arrow or using it btw internal nodes do not sound a good idea to me. It will largely prevent you from further internal optimizations inside your engine (unless we plan no more in the future). If you really decide to go forward, maybe a benchmark or prototype is needed.
In my experience, Arrow is not fit for data transmission, it's too heavy on encoding (people can easily just focus on the bright side, the decoding).
Back in TiFlash @ilovesoup mentioned above, when transmiting about 900M/s data between TiFlash and TiSpark, Arrow encoding ate up 12 cores, in the TPCH benchmark.
@ilovesoup @innerr can you enlighten me what you mean with encoding? I would accept that for a format like Parquet but in Arrow the data is quite "plain" in memory and the only thing that would roughly be some sort of encoding would be the validity bitmap. That is though extremely optimized at least in the C++ implementation.
Feature Request
Is your feature request related to a problem? Please describe:
TiDB adopts a vectorized execution engine method to improve query performance. Data is organized in a columnar data structure called
Chunk
, which is a columnar layout like Apache Arrow.The problems for us are: We have to maintain
Chunk
ourselves.If we can leverage Arrow in TiDB, we can build the library and ecosystem together with the Arrow community.
Describe the feature you'd like:
Support Apache Arrow, there are several use cases as described in Arrow use case document:
It's strongly recommended to also support Arrow in TiFlash.
Describe alternatives you've considered:
N/A
Teachability, Documentation, Adoption, Migration Strategy:
Things need to consider after Apache Arrow is supported in TiDB: