system.sync_partition_metadata is extremely slow for tables with large no of partitions

pPanda-beta commented 3 years ago

For costly io operations of 'list' on some cloud storage this sequential approach is becoming bottleneck. The following lines of code discovers partitions in a sequential and recursive manner.

https://github.com/prestosql/presto/blob/88d5d90aa147e1c170eb3e3d0fa3ab74c5a59d67/presto-hive/src/main/java/io/prestosql/plugin/hive/procedure/SyncPartitionMetadataProcedure.java#L168-L171

This can be easily parallelized by either using

Multithreading
or java nio (or any kind of non-blocking io)
or Java10 fibers

Suggestions for the basic refactoring, we can start with parallelStream() instead of stream()

pPanda-beta commented 3 years ago

/keep_alive

pPanda-beta commented 3 years ago

/keep_fresh

pPanda-beta commented 3 years ago

/keep_fresh

trinodb / trino

system.sync_partition_metadata is extremely slow for tables with large no of partitions #6098