trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.49k stars 3.02k forks source link

Hive query with Glue may fail with: Cannot make progress retrieving partitions. Unable to retrieve partitions #12516

Open findepi opened 2 years ago

findepi commented 2 years ago

Observed on CI on version 380

Error:  Tests run: 115, Failures: 1, Errors: 0, Skipped: 32, Time elapsed: 1,194.406 s <<< FAILURE! - in io.trino.plugin.hive.metastore.glue.TestHiveGlueMetastore
Error:  io.trino.plugin.hive.metastore.glue.TestHiveGlueMetastore.testPartitionStatisticsSampling  Time elapsed: 16.935 s  <<< FAILURE!
io.trino.spi.TrinoException: Cannot make progress retrieving partitions. Unable to retrieve partitions: [{Values: [2016-01-01]}]
    at io.trino.plugin.hive.metastore.glue.GlueHiveMetastore.batchGetPartition(GlueHiveMetastore.java:890)
    at io.trino.plugin.hive.metastore.glue.GlueHiveMetastore.updatePartitionStatisticsBatch(GlueHiveMetastore.java:363)
    at io.trino.plugin.hive.metastore.glue.GlueHiveMetastore.lambda$updatePartitionStatistics$8(GlueHiveMetastore.java:352)
    at java.base/java.lang.Iterable.forEach(Iterable.java:75)
    at io.trino.plugin.hive.metastore.glue.GlueHiveMetastore.updatePartitionStatistics(GlueHiveMetastore.java:[351](https://github.com/starburstdata/trino-lts/runs/6554638931?check_suite_focus=true#step:11:352))
    at io.trino.plugin.hive.metastore.HiveMetastore.updatePartitionStatistics(HiveMetastore.java:55)
    at io.trino.plugin.hive.HiveMetastoreClosure.updatePartitionStatistics(HiveMetastoreClosure.java:115)
    at io.trino.plugin.hive.AbstractTestHive.testPartitionStatisticsSampling(AbstractTestHive.java:3428)
    at io.trino.plugin.hive.AbstractTestHive.testPartitionStatisticsSampling(AbstractTestHive.java:3417)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:104)
    at org.testng.internal.Invoker.invokeMethod(Invoker.java:645)
    at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:851)
    at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:1177)
    at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:129)
    at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:112)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)

The responsible line is here: https://github.com/trinodb/trino/blob/96a8f775f8763941deb9f3d2fb999d3a88015113/plugin/trino-hive/src/main/java/io/trino/plugin/hive/metastore/glue/GlueHiveMetastore.java#L887-L890

Follows #10696

findepi commented 2 years ago

The throwing code is a safety check to prevent infinite loop, sending same request over and over. We assumed batchGetPartitionAsync will always process at least one partition.

@pettyjamesm was the assumption wrong, or is it a Glue service's fault?

pettyjamesm commented 2 years ago

I'm not sure about that actually- I know that the behavior unprocessed keys list can be non-empty based on undocumented characteristics of the response but my understanding is that the mechanism is to prevent overly large response payloads. Since this is a CI run, I doubt that the specific partition in question was overly large to the point where a single partition wouldn't be returned- so I wonder whether this is actually a case of "partition does not exist" and the Glue API choosing to return requested partitions that do not exist as "unprocessed" instead of throwing an EntityNotFound exception like you might expect for a single get partition request.

findepi commented 2 years ago

so I wonder whether this is actually a case of "partition does not exist"

The CI failed at TestHiveGlueMetastore>AbstractTestHive.testPartitionStatisticsSampling:3417. The table has two partitions, and I don't think the behavior can vary from run to run.

I think we should assume the partition existed, unless there is some other bug and table wasn't created correctly.

@pettyjamesm can this be rate-limiting related?

findepi commented 2 years ago

it seems we're seeing this internally on CI (so more than just one occasion).

@pettyjamesm is it a Glue bug? when will it be fixed?

pettyjamesm commented 2 years ago

I have some more context now about how this might happen after talking with someone on the Glue team. That said, it would be great if we could nail down how exactly this was being triggered before trying to put together a fix to address this. Can you provide, for any occurrence of this issue, the following data points so that we can see what additional details we might be able to get before I attempt a code change?

findepi commented 2 years ago

Shared the info offline.

findepi commented 2 years ago

cc @ppalucha

findepi commented 2 years ago

Also encountered by Rodrigo as discussed here https://trinodb.slack.com/archives/CGB0QHWSW/p1662650934614339

io.trino.spi.TrinoException: Cannot make progress retrieving partitions. Unable to retrieve partitions: [{Values: [2019-02-21, 19]}.......

hashhar commented 1 year ago

Reported again at https://trinodb.slack.com/archives/CGB0QHWSW/p1686219201862739

coderbhupendra commented 1 year ago

On random days we are getting HIVE_METASTORE_ERROR error on different delta lake tables. Same query runs fine in next run. And tables underlying partitions are also not updating while this query is running.

io.trino.spi.TrinoException: Cannot make progress retrieving partitions. Unable to retrieve partitions: [{Values: [2023, 05, 07, 16]}, {Values: [2023, 05, 09, 05]}, {Values: [2023, 05, 10, 10]}, {Values: [2023, 05, 07, 07]}, {Values: [2023, 05, 10, 09]}, {Values: [2023, 05, 07, 13]}, {Values: [2023, 05, 09, 18]}, {Values: [2023, 05, 07, 17]}, {Values: [2023, 05, 07, 10]}, {Values: [2023, 05, 09, 22]}, {Values: [2023, 05, 08, 13]}, {Values: [2023, 05, 07, 15]}, {Values: [2023, 05, 08, 11]}, {Values: [2023, 05, 09, 15]}, {Values: [2023, 05, 10, 23]}, {Values: [2023, 05, 09, 12]}, {Values: [2023, 05, 10, 05]}, {Values: [2023, 05, 08, 15]}, {Values: [2023, 05, 07, 12]}, {Values: [2023, 05, 10, 04]}, {Values: [2023, 05, 09, 17]}, {Values: [2023, 05, 08, 10]}, {Values: [2023, 05, 10, 03]}, {Values: [2023, 05, 10, 21]}, {Values: [2023, 05, 08, 04]}, {Values: [2023, 05, 07, 06]}, {Values: [2023, 05, 08, 05]}, {Values: [2023, 05, 09, 01]}, {Values: [2023, 05, 09, 11]}, {Values: [2023, 05, 11, 06]}, {Values: [2023, 05, 09, 03]}, {Values: [2023, 05, 11, 09]}, {Values: [2023, 05, 07, 09]}, {Values: [2023, 05, 10, 11]}, {Values: [2023, 05, 08, 08]}, {Values: [2023, 05, 10, 13]}, {Values: [2023, 05, 08, 18]}, {Values: [2023, 05, 09, 16]}, {Values: [2023, 05, 10, 16]}, {Values: [2023, 05, 07, 14]}, {Values: [2023, 05, 11, 00]}, {Values: [2023, 05, 09, 06]}, {Values: [2023, 05, 09, 20]}, {Values: [2023, 05, 10, 12]}, {Values: [2023, 05, 08, 12]}, {Values: [2023, 05, 09, 14]}, {Values: [2023, 05, 09, 00]}, {Values: [2023, 05, 08, 07]}, {Values: [2023, 05, 08, 21]}, {Values: [2023, 05, 09, 04]}, {Values: [2023, 05, 08, 02]}, {Values: [2023, 05, 08, 01]}, {Values: [2023, 05, 08, 23]}, {Values: [2023, 05, 09, 23]}, {Values: [2023, 05, 10, 19]}, {Values: [2023, 05, 07, 19]}, {Values: [2023, 05, 08, 20]}, {Values: [2023, 05, 10, 17]}, {Values: [2023, 05, 10, 20]}, {Values: [2023, 05, 10, 06]}, {Values: [2023, 05, 07, 18]}, {Values: [2023, 05, 09, 07]}, {Values: [2023, 05, 10, 07]}, {Values: [2023, 05, 07, 21]}, {Values: [2023, 05, 08, 17]}, {Values: [2023, 05, 10, 01]}, {Values: [2023, 05, 10, 15]}, {Values: [2023, 05, 10, 22]}, {Values: [2023, 05, 08, 16]}, {Values: [2023, 05, 09, 09]}, {Values: [2023, 05, 07, 08]}, {Values: [2023, 05, 09, 08]}, {Values: [2023, 05, 09, 21]}, {Values: [2023, 05, 07, 23]}, {Values: [2023, 05, 10, 18]}, {Values: [2023, 05, 11, 05]}, {Values: [2023, 05, 07, 22]}, {Values: [2023, 05, 08, 00]}, {Values: [2023, 05, 08, 06]}, {Values: [2023, 05, 11, 01]}, {Values: [2023, 05, 07, 11]}, {Values: [2023, 05, 08, 03]}, {Values: [2023, 05, 09, 02]}, {Values: [2023, 05, 11, 07]}, {Values: [2023, 05, 07, 20]}, {Values: [2023, 05, 10, 00]}, {Values: [2023, 05, 11, 08]}, {Values: [2023, 05, 08, 14]}, {Values: [2023, 05, 11, 02]}, {Values: [2023, 05, 08, 09]}, {Values: [2023, 05, 10, 02]}, {Values: [2023, 05, 09, 13]}, {Values: [2023, 05, 11, 03]}, {Values: [2023, 05, 08, 22]}, {Values: [2023, 05, 10, 08]}, {Values: [2023, 05, 08, 19]}, {Values: [2023, 05, 09, 10]}, {Values: [2023, 05, 10, 14]}, {Values: [2023, 05, 09, 19]}, {Values: [2023, 05, 11, 04]}] at io.trino.plugin.hive.metastore.glue.GlueHiveMetastore.batchGetPartition(GlueHiveMetastore.java:939) at io.trino.plugin.hive.metastore.glue.GlueHiveMetastore.getPartitionsByNamesInternal(GlueHiveMetastore.java:893) at io.trino.plugin.hive.metastore.glue.GlueHiveMetastore.lambda$getPartitionsByNames$28(GlueHiveMetastore.java:883) at io.trino.plugin.hive.aws.AwsApiCallStats.call(AwsApiCallStats.java:37) at io.trino.plugin.hive.metastore.glue.GlueHiveMetastore.getPartitionsByNames(GlueHiveMetastore.java:883) at io.trino.plugin.hive.metastore.ForwardingHiveMetastore.getPartitionsByNames(ForwardingHiveMetastore.java:247) at io.trino.plugin.hive.aws.athena.PartitionProjectionMetastoreDecorator$PartitionProjectionMetastore.getPartitionsByNames(PartitionProjectionMetastoreDecorator.java:89) at io.trino.plugin.hive.metastore.cache.CachingHiveMetastore.loadPartitionsByNames(CachingHiveMetastore.java:732) at io.trino.plugin.hive.metastore.cache.CachingHiveMetastore$1.loadAll(CachingHiveMetastore.java:1076) at io.trino.collect.cache.EvictableCache$TokenCacheLoader.loadAll(EvictableCache.java:463) at com.google.common.cache.LocalCache.loadAll(LocalCache.java:4073) at com.google.common.cache.LocalCache.getAll(LocalCache.java:4036) at com.google.common.cache.LocalCache$LocalLoadingCache.getAll(LocalCache.java:4964) at io.trino.collect.cache.EvictableCache.getAll(EvictableCache.java:202) at io.trino.plugin.hive.metastore.cache.CachingHiveMetastore.getAll(CachingHiveMetastore.java:254) at io.trino.plugin.hive.metastore.cache.CachingHiveMetastore.getPartitionsByNames(CachingHiveMetastore.java:696) at io.trino.plugin.hive.metastore.cache.CachingHiveMetastore.loadPartitionsByNames(CachingHiveMetastore.java:732) at io.trino.plugin.hive.metastore.cache.CachingHiveMetastore$1.loadAll(CachingHiveMetastore.java:1076) at io.trino.collect.cache.EvictableCache$TokenCacheLoader.loadAll(EvictableCache.java:463) at com.google.common.cache.LocalCache.loadAll(LocalCache.java:4073) at com.google.common.cache.LocalCache.getAll(LocalCache.java:4036) at com.google.common.cache.LocalCache$LocalLoadingCache.getAll(LocalCache.java:4964) at io.trino.collect.cache.EvictableCache.getAll(EvictableCache.java:202) at io.trino.plugin.hive.metastore.cache.CachingHiveMetastore.getAll(CachingHiveMetastore.java:254) at io.trino.plugin.hive.metastore.cache.CachingHiveMetastore.getPartitionsByNames(CachingHiveMetastore.java:696) at io.trino.plugin.hive.HiveMetastoreClosure.lambda$getPartitionsByNames$4(HiveMetastoreClosure.java:238) at java.base/java.util.Optional.map(Optional.java:260) at io.trino.plugin.hive.HiveMetastoreClosure.getPartitionsByNames(HiveMetastoreClosure.java:238) at io.trino.plugin.hive.metastore.SemiTransactionalHiveMetastore.getPartitionsByNames(SemiTransactionalHiveMetastore.java:1002) at io.trino.plugin.hive.HiveSplitManager.lambda$getPartitionMetadata$6(HiveSplitManager.java:533) at com.google.common.collect.Iterators$6.transform(Iterators.java:829) at com.google.common.collect.TransformedIterator.next(TransformedIterator.java:52) at java.base/java.util.Spliterators$IteratorSpliterator.tryAdvance(Spliterators.java:1856) at java.base/java.util.stream.StreamSpliterators$WrappingSpliterator.lambda$initPartialTraversalState$0(StreamSpliterators.java:292) at java.base/java.util.stream.StreamSpliterators$AbstractWrappingSpliterator.fillBuffer(StreamSpliterators.java:206) at java.base/java.util.stream.StreamSpliterators$AbstractWrappingSpliterator.doAdvance(StreamSpliterators.java:169) at java.base/java.util.stream.StreamSpliterators$WrappingSpliterator.tryAdvance(StreamSpliterators.java:298) at java.base/java.util.Spliterators$1Adapter.hasNext(Spliterators.java:681) at io.trino.plugin.hive.ConcurrentLazyQueue.isEmpty(ConcurrentLazyQueue.java:34) at io.trino.plugin.hive.BackgroundHiveSplitLoader.loadSplits(BackgroundHiveSplitLoader.java:380) at io.trino.plugin.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:297) at io.trino.plugin.hive.util.ResumableTasks$1.run(ResumableTasks.java:38) at io.trino.$gen.Trino_403_amzn_0____20230606_135704_2.run(Unknown Source) at io.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:80) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base/java.lang.Thread.run(Thread.java:833)