Closed ebyhr closed 1 year ago
TestHivePartitionsTable.testShowPartitionsFromHiveTableWithTooManyPartitions
failed in https://github.com/prestosql/presto/runs/1205417196
2020-10-04T13:17:25.2762652Z tests | 2020-10-04 19:02:25 INFO: FAILURE / io.prestosql.tests.hive.TestHivePartitionsTable.testShowPartitionsFromHiveTableWithTooManyPartitions (Groups: hive_partitioning) took 4.1 seconds
2020-10-04T13:17:25.2814457Z tests | 2020-10-04 19:02:25 SEVERE: Failure cause:
2020-10-04T13:17:25.2841985Z tests | io.prestosql.tempto.query.QueryExecutionException: java.sql.SQLException: Query failed (#20201004_131721_00842_pt8xg): Error committing write to Hive
2020-10-04T13:17:25.2844345Z tests | at io.prestosql.tempto.query.JdbcQueryExecutor.execute(JdbcQueryExecutor.java:114)
2020-10-04T13:17:25.2846441Z tests | at io.prestosql.tempto.query.JdbcQueryExecutor.executeQuery(JdbcQueryExecutor.java:82)
2020-10-04T13:17:25.2848427Z tests | at io.prestosql.tempto.query.QueryExecutor.query(QueryExecutor.java:57)
2020-10-04T13:17:25.2850467Z tests | at io.prestosql.tests.hive.TestHivePartitionsTable.lambda$createPartitions$4(TestHivePartitionsTable.java:171)
2020-10-04T13:17:25.2852496Z tests | at java.base/java.util.stream.Streams$RangeIntSpliterator.forEachRemaining(Streams.java:104)
2020-10-04T13:17:25.2853830Z tests | at java.base/java.util.stream.IntPipeline$Head.forEach(IntPipeline.java:593)
2020-10-04T13:17:25.2855806Z tests | at io.prestosql.tests.hive.TestHivePartitionsTable.createPartitions(TestHivePartitionsTable.java:168)
2020-10-04T13:17:25.2860649Z tests | at io.prestosql.tests.hive.TestHivePartitionsTable.testShowPartitionsFromHiveTableWithTooManyPartitions(TestHivePartitionsTable.java:135)
2020-10-04T13:17:25.2864788Z tests | at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
2020-10-04T13:17:25.2867051Z tests | at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
2020-10-04T13:17:25.2869979Z tests | at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
2020-10-04T13:17:25.2871800Z tests | at java.base/java.lang.reflect.Method.invoke(Method.java:566)
2020-10-04T13:17:25.2873552Z tests | at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:104)
2020-10-04T13:17:25.2875345Z tests | at org.testng.internal.Invoker.invokeMethod(Invoker.java:645)
2020-10-04T13:17:25.2876760Z tests | at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:851)
2020-10-04T13:17:25.2878192Z tests | at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:1177)
2020-10-04T13:17:25.2879962Z tests | at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:129)
2020-10-04T13:17:25.2881716Z tests | at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:112)
2020-10-04T13:17:25.2883415Z tests | at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
2020-10-04T13:17:25.2885438Z tests | at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
2020-10-04T13:17:25.2886677Z tests | at java.base/java.lang.Thread.run(Thread.java:834)
2020-10-04T13:17:25.2887530Z tests | Caused by: java.sql.SQLException: Query failed (#20201004_131721_00842_pt8xg): Error committing write to Hive
2020-10-04T13:17:25.2889358Z tests | at io.prestosql.jdbc.AbstractPrestoResultSet.resultsException(AbstractPrestoResultSet.java:1761)
2020-10-04T13:17:25.2891644Z tests | at io.prestosql.jdbc.PrestoResultSet$ResultsPageIterator.computeNext(PrestoResultSet.java:228)
2020-10-04T13:17:25.2893538Z tests | at io.prestosql.jdbc.PrestoResultSet$ResultsPageIterator.computeNext(PrestoResultSet.java:188)
2020-10-04T13:17:25.2895798Z tests | at io.prestosql.jdbc.$internal.guava.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:141)
2020-10-04T13:17:25.2898171Z tests | at io.prestosql.jdbc.$internal.guava.collect.AbstractIterator.hasNext(AbstractIterator.java:136)
2020-10-04T13:17:25.2899986Z tests | at java.base/java.util.Spliterators$IteratorSpliterator.tryAdvance(Spliterators.java:1811)
2020-10-04T13:17:25.2901951Z tests | at java.base/java.util.stream.StreamSpliterators$WrappingSpliterator.lambda$initPartialTraversalState$0(StreamSpliterators.java:294)
2020-10-04T13:17:25.2904351Z tests | at java.base/java.util.stream.StreamSpliterators$AbstractWrappingSpliterator.fillBuffer(StreamSpliterators.java:206)
2020-10-04T13:17:25.2906729Z tests | at java.base/java.util.stream.StreamSpliterators$AbstractWrappingSpliterator.doAdvance(StreamSpliterators.java:161)
2020-10-04T13:17:25.2908940Z tests | at java.base/java.util.stream.StreamSpliterators$WrappingSpliterator.tryAdvance(StreamSpliterators.java:300)
2020-10-04T13:17:25.2910514Z tests | at java.base/java.util.Spliterators$1Adapter.hasNext(Spliterators.java:681)
2020-10-04T13:17:25.2911849Z tests | at io.prestosql.jdbc.PrestoResultSet$AsyncIterator.lambda$new$0(PrestoResultSet.java:134)
2020-10-04T13:17:25.2993976Z tests | at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736)
2020-10-04T13:17:25.2994977Z tests | ... 3 more
2020-10-04T13:17:25.2995653Z tests | Caused by: io.prestosql.spi.PrestoException: Error committing write to Hive
2020-10-04T13:17:25.2997022Z tests | at io.prestosql.plugin.hive.orc.OrcFileWriter.commit(OrcFileWriter.java:155)
2020-10-04T13:17:25.2998544Z tests | at io.prestosql.plugin.hive.HiveWriter.commit(HiveWriter.java:86)
2020-10-04T13:17:25.3000214Z tests | at io.prestosql.plugin.hive.HivePageSink.doFinish(HivePageSink.java:190)
2020-10-04T13:17:25.3002499Z tests | at io.prestosql.plugin.hive.authentication.NoHdfsAuthentication.doAs(NoHdfsAuthentication.java:23)
2020-10-04T13:17:25.3004672Z tests | at io.prestosql.plugin.hive.HdfsEnvironment.doAs(HdfsEnvironment.java:96)
2020-10-04T13:17:25.3006219Z tests | at io.prestosql.plugin.hive.HivePageSink.finish(HivePageSink.java:181)
2020-10-04T13:17:25.3009043Z tests | at io.prestosql.plugin.base.classloader.ClassLoaderSafeConnectorPageSink.finish(ClassLoaderSafeConnectorPageSink.java:77)
2020-10-04T13:17:25.3011991Z tests | at io.prestosql.operator.TableWriterOperator.finish(TableWriterOperator.java:208)
2020-10-04T13:17:25.3013613Z tests | at io.prestosql.operator.Driver.processInternal(Driver.java:397)
2020-10-04T13:17:25.3014873Z tests | at io.prestosql.operator.Driver.lambda$processFor$8(Driver.java:283)
2020-10-04T13:17:25.3016054Z tests | at io.prestosql.operator.Driver.tryWithLock(Driver.java:675)
2020-10-04T13:17:25.3017240Z tests | at io.prestosql.operator.Driver.processFor(Driver.java:276)
2020-10-04T13:17:25.3018776Z tests | at io.prestosql.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1076)
2020-10-04T13:17:25.3021073Z tests | at io.prestosql.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:163)
2020-10-04T13:17:25.3023252Z tests | at io.prestosql.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:484)
2020-10-04T13:17:25.3024393Z tests | at io.prestosql.$gen.Presto_84cbc9e____20201004_131106_2.run(Unknown Source)
2020-10-04T13:17:25.3025680Z tests | at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
2020-10-04T13:17:25.3027471Z tests | at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
2020-10-04T13:17:25.3028619Z tests | at java.base/java.lang.Thread.run(Thread.java:834)
2020-10-04T13:17:25.3031162Z tests | Caused by: org.apache.hadoop.ipc.RemoteException: File /tmp/presto-hive/ddbad1e4-d67c-49bf-9b39-cd862cfe5cf2/part_col=3/20201004_131721_00842_pt8xg_a2bc03d2-4b6c-49a0-a481-1d61d15e62d5 could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and no node(s) are excluded in this operation.
2020-10-04T13:17:25.3034568Z tests | at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1719)
2020-10-04T13:17:25.3037791Z tests | at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3508)
2020-10-04T13:17:25.3040562Z tests | at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:694)
2020-10-04T13:17:25.3044980Z tests | at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.addBlock(AuthorizationProviderProxyClientProtocol.java:219)
2020-10-04T13:17:25.3050723Z tests | at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:507)
2020-10-04T13:17:25.3055612Z tests | at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
2020-10-04T13:17:25.3058809Z tests | at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
2020-10-04T13:17:25.3060116Z tests | at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
2020-10-04T13:17:25.3060951Z tests | at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2281)
2020-10-04T13:17:25.3061918Z tests | at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2277)
2020-10-04T13:17:25.3063033Z tests | at java.security.AccessController.doPrivileged(Native Method)
2020-10-04T13:17:25.3064328Z tests | at javax.security.auth.Subject.doAs(Subject.java:422)
2020-10-04T13:17:25.3065937Z tests | at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1924)
2020-10-04T13:17:25.3067545Z tests | at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2275)
2020-10-04T13:17:25.3068170Z tests |
2020-10-04T13:17:25.3069066Z tests | at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1511)
2020-10-04T13:17:25.3070300Z tests | at org.apache.hadoop.ipc.Client.call(Client.java:1457)
2020-10-04T13:17:25.3071324Z tests | at org.apache.hadoop.ipc.Client.call(Client.java:1367)
2020-10-04T13:17:25.3072679Z tests | at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
2020-10-04T13:17:25.3095596Z tests | at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
2020-10-04T13:17:25.3096863Z tests | at com.sun.proxy.$Proxy241.addBlock(Unknown Source)
2020-10-04T13:17:25.3099220Z tests | at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:513)
2020-10-04T13:17:25.3102226Z tests | at jdk.internal.reflect.GeneratedMethodAccessor411.invoke(Unknown Source)
2020-10-04T13:17:25.3104494Z tests | at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
2020-10-04T13:17:25.3106385Z tests | at java.base/java.lang.reflect.Method.invoke(Method.java:566)
2020-10-04T13:17:25.3108210Z tests | at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
2020-10-04T13:17:25.3110631Z tests | at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
2020-10-04T13:17:25.3112734Z tests | at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
2020-10-04T13:17:25.3114910Z tests | at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
2020-10-04T13:17:25.3117136Z tests | at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
2020-10-04T13:17:25.3118629Z tests | at com.sun.proxy.$Proxy242.addBlock(Unknown Source)
2020-10-04T13:17:25.3119792Z tests | at org.apache.hadoop.hdfs.DFSOutputStream.addBlock(DFSOutputStream.java:1081)
2020-10-04T13:17:25.3121653Z tests | at org.apache.hadoop.hdfs.DataStreamer.locateFollowingBlock(DataStreamer.java:1865)
2020-10-04T13:17:25.3123766Z tests | at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1668)
2020-10-04T13:17:25.3125482Z tests | at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:716)
2020-10-04T13:17:25.3126222Z tests |
I marked two example failures with @Flaky
-- https://github.com/prestosql/presto/pull/5811 and https://github.com/prestosql/presto/pull/5829.
however, since the nature of the problem seems only so much related to actual test, we may need a more generic approach... unless, of course, we learn how to fix the problem.
@MiguelWeezardo would it be possible to add a retry to onHive()
so that these are automatically retried without need for @Flaky
on sooo many tests?
(many are still not covered yet)
it be possible to add a retry to onHive()
That should not be enabled by default. There are plenty of queries that should not be retried or they are simply not retryable.
How about.
onHive()
.retryWhen(ex -> ex.getMessages().contains("Hive is flaky again...")
.execute("select 1 from \"$dummy\"");
?
Another option is wrapping query executors like:
/**
* This method could be only used for state-less JDBC connections - i.e. Presto.
* Otherwise connection will be surely broken.
*/
public static QueryExecutor retryOnNetworkIssues(QueryExecutor queryExecutor)
{
RetryPolicy<QueryResult> queryRetryPolicy = new RetryPolicy<QueryResult>()
.withBackoff(1, 60, ChronoUnit.SECONDS)
.withMaxRetries(10)
.handleIf(QueryExecutors::isNetworkIssue)
.onRetry(event -> log.warn("Query failed on attempt %s, will retry. Exception: %s", event.getAttemptCount(), event.getLastFailure().getMessage()));
return new QueryExecutor()
{
@Override
public QueryResult executeQuery(String sql, QueryParam... params)
throws QueryExecutionException
{
return Failsafe.with(queryRetryPolicy)
.get(() -> queryExecutor.executeQuery(sql, params));
}
@Override
public Connection getConnection()
{
return queryExecutor.getConnection();
}
@Override
public void close()
{
queryExecutor.close();
}
};
}
private static boolean isNetworkIssue(Throwable throwable)
{
return Throwables.getCausalChain(throwable).stream()
.anyMatch(exception -> exception instanceof SocketException
|| exception instanceof SocketTimeoutException);
}
That should not be enabled by default. There are plenty of queries that should not be retried or they are simply not retryable.
@kokosing i meant to only retry those that fail with https://github.com/trinodb/trino/blob/cea58426e0a8f4cfe330bc805f88d12ab2d3fc4e/testing/trino-product-tests/src/main/java/io/trino/tests/hive/HiveProductTest.java#L26-L28
plus, we can match on the query being done, eg it's not safe to retry INSERT statements
How about.
onHive() .retryWhen(ex -> ex.getMessages().contains("Hive is flaky again...") .execute("select 1 from \"$dummy\"");
if i were to change each test's code, i'd just continue with the current approach
@Flaky(issue = ERROR_COMMITTING_WRITE_TO_HIVE_ISSUE, match = ERROR_COMMITTING_WRITE_TO_HIVE_MATCH)
Should be fixed by https://github.com/trinodb/trino/pull/6601
Hopefully solved by https://github.com/trinodb/trino/pull/6601 thanks @MiguelWeezardo
if that turns out not to be sufficient in some cases, we may need a more focused approach in those cases, so let me close the issue
Well, I saw what looks like this in one of my PRs:
https://github.com/trinodb/trino/runs/1911881714?check_suite_focus=true#step:6:12515
Well, I saw what looks like this in one of my PRs:
https://github.com/trinodb/trino/runs/1911881714?check_suite_focus=true#step:6:12515
The Failsafe code is on the failure stack, but it looks like no retries were attempted. This might mean the regular expression failed to match the exception message.
Is there any issues with message regex?
Is there any issues with message regex?
The regex is multiline, and focuses on messages from two different exceptions in the causal chain, which may be why it didn't match.
HiveProductTest#isErrorCommittingToHive
matches it against the getMessage()
string of each exception in the causal chain. I assumed the getMessage()
includes the messages of it's causes just as we see it in the logs, but this was probably wrong.
I guess this code should generate the entire recursive stacktrace as a string and match against that.
Throwables#getStackTraceAsString
might be useful here
The regex is multiline, and focuses on messages from two different exceptions in the causal chain, ...
That's why it has (?s:.*)
and not just .*
.
The regex is multiline, and focuses on messages from two different exceptions in the causal chain, ...
That's why it has
(?s:.*)
and not just.*
.
I thought it's a simple non-capturing group, and failed to notice the s
flag until now.
I copied the matching from FlakyTestRetryAnalizer
in #6944
Happens also to Iceberg+Spark tests when trying to write to HDFS: https://github.com/trinodb/trino/runs/5280873825?check_suite_focus=true
We have retries in place, so no more work planned to be done here, I am closing the issue
Please reopen if you see this again. Be sure to provide failure link AND stacktraces. Github actions logs are not kept for ever.
https://github.com/prestosql/presto/pull/4846/checks?check_run_id=1017285394