lightning might stuck for hours when importing parquet files from cloud storage

D3Hunter commented 3 days ago

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

after https://github.com/pingcap/tidb/pull/46984, including >= v7.5.x, 7.1.3+, we will sample parquet all the time in serial, it's very slow and might takes hours if user have large mount of parquet files before lightning start doing import, and the time takes to sample the files might even longer than real import work.

we only need this size for displaying progress more accurately and use it as a reference when splitting engine, but slowing import this much is un-acceptable.

2. What did you expect to see? (Required)

start import fast

3. What did you see instead (Required)

it might takes hours before start doing any work

4. What is your TiDB version? (Required)

D3Hunter commented 3 days ago

it's a enhance type of bug, it need to pick back

zeminzhou commented 1 day ago

How about sampling just one file? Sample the first file and calculate a compression ratio, and use this compression ratio to estimate the remaining files.

D3Hunter commented 1 day ago

lgtm

pingcap / tidb