prestodb / presto

The official home of the Presto distributed SQL query engine for big data
http://prestodb.io
Apache License 2.0
16.06k stars 5.38k forks source link

Add support for Glue catalog in Iceberg connector #20423

Open tdcmeehan opened 1 year ago

tdcmeehan commented 1 year ago

Iceberg deployments may use multiple catalogs. We should add support for additional catalogs so users don't need to migrate their catalog to begin using Presto on Iceberg.

This issue tracks the implementation of the currently missing Glue catalog.

agrawalreetika commented 1 year ago

@tdcmeehan Wanted to confirm if this is about using supporting Glue as metastore for Iceberg Table? If yes, then that's already supported - https://prestodb.io/docs/current/connector/iceberg.html#glue-catalog

jasonf20 commented 11 months ago

@agrawalreetika As far as I can tell that uses the the HiveTableOperations backed by glue. Instead of the GlueTableOperations of iceberg.

@tdcmeehan I think the priority of this might be higher since I have tables that take very long to get passed the planning stage (~10 minutes) while in Trino, using the native GlueCatalog, I don't have this issue at all. I think this is related to using the HiveTableOperations.

jasonf20 commented 11 months ago

Actually, the slowness seems to be from not caching the tables used during the query. Trino caches the TableMetadata object and when getting re-uses it throughout the query. For tables with a large TableMetadata this cache is really important. Changing the planning phase from minutes to seconds.

tdcmeehan commented 11 months ago

CC: @agrawalreetika who has been looking into table metadata caching and I believe has a prototype and observed a similar speedup

agrawalreetika commented 11 months ago

@tdcmeehan @jasonf20 Yes I have had similar findings where repetitive Metadata callls are causing higher planning time and eventually high Query execution time. Specially this is causing slower query execution on Iceberg Native catalogs. I have a prototype which could help us in reducing these metadata calls per query and which would drastically reduce Query execution time for both Hive & Native Iceberg catalogs. I will create a PR for same and list all the details soon.