trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.16k stars 2.93k forks source link

system.sync_partition_metadata is extremely slow for tables with large no of partitions #6098

Open pPanda-beta opened 3 years ago

pPanda-beta commented 3 years ago

For costly io operations of 'list' on some cloud storage this sequential approach is becoming bottleneck. The following lines of code discovers partitions in a sequential and recursive manner.

https://github.com/prestosql/presto/blob/88d5d90aa147e1c170eb3e3d0fa3ab74c5a59d67/presto-hive/src/main/java/io/prestosql/plugin/hive/procedure/SyncPartitionMetadataProcedure.java#L168-L171

This can be easily parallelized by either using

  1. Multithreading
  2. or java nio (or any kind of non-blocking io)
  3. or Java10 fibers

Suggestions for the basic refactoring, we can start with parallelStream() instead of stream()

pPanda-beta commented 3 years ago

/keep_alive

pPanda-beta commented 3 years ago

/keep_fresh

pPanda-beta commented 3 years ago

/keep_fresh