TiDB is an open-source, cloud-native, distributed, MySQL-Compatible database for elastic scale and real-time analytics. Try AI-powered Chat2Query free at : https://www.pingcap.com/tidb-serverless/
Please answer these questions before submitting your issue. Thanks!
1. Minimal reproduce step (Required)
after https://github.com/pingcap/tidb/pull/46984, including >= v7.5.x, 7.1.3+, we will sample parquet all the time in serial, it's very slow and might takes hours if user have large mount of parquet files before lightning start doing import, and the time takes to sample the files might even longer than real import work.
we only need this size for displaying progress more accurately and use it as a reference when splitting engine, but slowing import this much is un-acceptable.
How about sampling just one file? Sample the first file and calculate a compression ratio, and use this compression ratio to estimate the remaining files.
Bug Report
Please answer these questions before submitting your issue. Thanks!
1. Minimal reproduce step (Required)
after https://github.com/pingcap/tidb/pull/46984, including
>= v7.5.x
,7.1.3+
, we will sample parquet all the time in serial, it's very slow and might takes hours if user have large mount of parquet files before lightning start doing import, and the time takes to sample the files might even longer than real import work.we only need this size for displaying progress more accurately and use it as a reference when splitting engine, but slowing import this much is un-acceptable.
2. What did you expect to see? (Required)
start import fast
3. What did you see instead (Required)
it might takes hours before start doing any work
4. What is your TiDB version? (Required)