Open gaurav8297 opened 1 year ago
let's merge this with https://github.com/trinodb/trino/issues/14516 i don't think there is a reason to treat them separate for now
I think it's actually a product bug. Some rows are missing when writer count is increased with MERGE.
https://github.com/trinodb/trino/actions/runs/4115290901/jobs/7108629638
TestHiveMerge > testMergeMultipleOperationsUnbucketedPartitioned [groups: hive_transactional]
Could not find rows:
[joe_32, 5000, 85000, jen_32, 32 Poe Ct]
[joe_33, 5000, 85000, jen_33, 33 Poe Ct]
[joe_34, 5000, 85000, jen_34, 34 Poe Ct]
[joe_35, 5000, 85000, jen_35, 35 Poe Ct]
[joe_36, 5000, 85000, jen_36, 36 Poe Ct]
[joe_37, 5000, 85000, jen_37, 37 Poe Ct]
[joe_38, 5000, 85000, jen_38, 38 Poe Ct]
[joe_39, 5000, 85000, jen_39, 39 Poe Ct]
[joe_40, 5000, 85000, jen_40, 40 Poe Ct]
[joe_41, 5000, 85000, jen_41, 41 Poe Ct]
[joe_42, 5000, 85000, jen_42, 42 Poe Ct]
[joe_43, 5000, 85000, jen_43, 43 Poe Ct]
[joe_44, 5000, 85000, jen_44, 44 Poe Ct]
[joe_45, 5000, 85000, jen_45, 45 Poe Ct]
[joe_46, 5000, 85000, jen_46, 46 Poe Ct]
[joe_47, 5000, 85000, jen_47, 47 Poe Ct]
actual rows:
[jack_46, 4000, 74000, jan_46, 46 Poe Ct]
[joe_10, 5000, 85000, jen_10, 10 Poe Ct]
[joe_36, 5000, 60000, jen_36, 36 Poe Ct]
[joe_21, 3000, 60000, jill_21, 21 Eop Ct]
[joe_44, 5000, 60000, jen_44, 44 Poe Ct]
[jack_47, 4000, 74000, jan_47, 47 Poe Ct]
[jack_39, 4000, 74000, jan_39, 39 Poe Ct]
[joe_46, 5000, 60000, jen_46, 46 Poe Ct]
[jack_42, 4000, 74000, jan_42, 42 Poe Ct]
[joe_33, 5000, 60000, jen_33, 33 Poe Ct]
[joe_29, 3000, 60000, jill_29, 29 Eop Ct]
[joe_41, 5000, 60000, jen_41, 41 Poe Ct]
[joe_15, 5000, 85000, jen_15, 15 Poe Ct]
[jack_32, 4000, 74000, jan_32, 32 Poe Ct]
[joe_16, 3000, 60000, jill_16, 16 Eop Ct]
[joe_27, 3000, 60000, jill_27, 27 Eop Ct]
[joe_22, 3000, 60000, jill_22, 22 Eop Ct]
[joe_30, 3000, 60000, jill_30, 30 Eop Ct]
[joe_45, 5000, 60000, jen_45, 45 Poe Ct]
[joe_42, 5000, 60000, jen_42, 42 Poe Ct]
[joe_18, 3000, 60000, jill_18, 18 Eop Ct]
[joe_25, 3000, 60000, jill_25, 25 Eop Ct]
[joe_13, 5000, 85000, jen_13, 13 Poe Ct]
[joe_4, 5000, 85000, jen_4, 4 Poe Ct]
[joe_19, 3000, 60000, jill_19, 19 Eop Ct]
[joe_14, 5000, 85000, jen_14, 14 Poe Ct]
[joe_34, 5000, 60000, jen_34, 34 Poe Ct]
[joe_40, 5000, 60000, jen_40, 40 Poe Ct]
[jack_45, 4000, 74000, jan_45, 45 Poe Ct]
[jack_35, 4000, 74000, jan_35, 35 Poe Ct]
[joe_24, 3000, 60000, jill_24, 24 Eop Ct]
[joe_20, 3000, 60000, jill_20, 20 Eop Ct]
[jack_38, 4000, 74000, jan_38, 38 Poe Ct]
[joe_47, 5000, 60000, jen_47, 47 Poe Ct]
[jack_43, 4000, 74000, jan_43, 43 Poe Ct]
[jack_44, 4000, 74000, jan_44, 44 Poe Ct]
[joe_43, 5000, 60000, jen_43, 43 Poe Ct]
[joe_39, 5000, 60000, jen_39, 39 Poe Ct]
[joe_11, 5000, 85000, jen_11, 11 Poe Ct]
[joe_1, 5000, 85000, jen_1, 1 Poe Ct]
[joe_23, 3000, 60000, jill_23, 23 Eop Ct]
[jack_34, 4000, 74000, jan_34, 34 Poe Ct]
[joe_35, 5000, 60000, jen_35, 35 Poe Ct]
[joe_38, 5000, 60000, jen_38, 38 Poe Ct]
[joe_8, 5000, 85000, jen_8, 8 Poe Ct]
[joe_6, 5000, 85000, jen_6, 6 Poe Ct]
[jack_41, 4000, 74000, jan_41, 41 Poe Ct]
[jack_40, 4000, 74000, jan_40, 40 Poe Ct]
[joe_2, 5000, 85000, jen_2, 2 Poe Ct]
[joe_31, 3000, 60000, jill_31, 31 Eop Ct]
[joe_5, 5000, 85000, jen_5, 5 Poe Ct]
[joe_37, 5000, 60000, jen_37, 37 Poe Ct]
[joe_26, 3000, 60000, jill_26, 26 Eop Ct]
[jack_36, 4000, 74000, jan_36, 36 Poe Ct]
[joe_12, 5000, 85000, jen_12, 12 Poe Ct]
[joe_7, 5000, 85000, jen_7, 7 Poe Ct]
[joe_32, 5000, 60000, jen_32, 32 Poe Ct]
[joe_28, 3000, 60000, jill_28, 28 Eop Ct]
[jack_37, 4000, 74000, jan_37, 37 Poe Ct]
[joe_3, 5000, 85000, jen_3, 3 Poe Ct]
[joe_9, 5000, 85000, jen_9, 9 Poe Ct]
[jack_33, 4000, 74000, jan_33, 33 Poe Ct]
[joe_17, 3000, 60000, jill_17, 17 Eop Ct]
cc @djsstarburst
Same error, but different values
https://github.com/trinodb/trino/actions/runs/4143381507/jobs/7169852819
2023-02-10T18:47:18.0284529Z tests | 2023-02-11 00:32:18 INFO: FAILURE / io.trino.tests.product.hive.TestHiveMerge.testMergeMultipleOperationsUnbucketedPartitioned (Groups: hive_transactional) took 10.4 seconds
2023-02-10T18:47:18.0333685Z tests | 2023-02-11 00:32:18 SEVERE: Failure cause:
2023-02-10T18:47:18.0334056Z tests | java.lang.AssertionError: Could not find rows:
2023-02-10T18:47:18.0334416Z tests | [joe_16, 3000, 83000, jill_16, 16 Eop Ct]
2023-02-10T18:47:18.0334694Z tests | [joe_17, 3000, 83000, jill_17, 17 Eop Ct]
2023-02-10T18:47:18.0334984Z tests | [joe_18, 3000, 83000, jill_18, 18 Eop Ct]
2023-02-10T18:47:18.0335250Z tests | [joe_19, 3000, 83000, jill_19, 19 Eop Ct]
2023-02-10T18:47:18.0335514Z tests | [joe_20, 3000, 83000, jill_20, 20 Eop Ct]
2023-02-10T18:47:18.0335771Z tests | [joe_21, 3000, 83000, jill_21, 21 Eop Ct]
2023-02-10T18:47:18.0336019Z tests | [joe_22, 3000, 83000, jill_22, 22 Eop Ct]
2023-02-10T18:47:18.0336268Z tests | [joe_23, 3000, 83000, jill_23, 23 Eop Ct]
2023-02-10T18:47:18.0336528Z tests | [joe_24, 3000, 83000, jill_24, 24 Eop Ct]
2023-02-10T18:47:18.0336777Z tests | [joe_25, 3000, 83000, jill_25, 25 Eop Ct]
2023-02-10T18:47:18.0337027Z tests | [joe_26, 3000, 83000, jill_26, 26 Eop Ct]
2023-02-10T18:47:18.0337278Z tests | [joe_28, 3000, 83000, jill_28, 28 Eop Ct]
2023-02-10T18:47:18.0337526Z tests | [joe_29, 3000, 83000, jill_29, 29 Eop Ct]
2023-02-10T18:47:18.0337948Z tests | [joe_30, 3000, 83000, jill_30, 30 Eop Ct]
2023-02-10T18:47:18.0338210Z tests | [joe_31, 3000, 83000, jill_31, 31 Eop Ct]
2023-02-10T18:47:18.0338440Z tests |
2023-02-10T18:47:18.0338665Z tests | actual rows:
2023-02-10T18:47:18.0338940Z tests | [joe_1, 1000, 91000, jan_1, 1 Poe Ct]
2023-02-10T18:47:18.0339208Z tests | [joe_15, 1000, 91000, jan_15, 15 Poe Ct]
2023-02-10T18:47:18.0339482Z tests | [joe_27, 3000, 83000, jill_27, 27 Eop Ct]
2023-02-10T18:47:18.0339746Z tests | [joe_14, 1000, 91000, jan_14, 14 Poe Ct]
2023-02-10T18:47:18.0340023Z tests | [joe_10, 1000, 91000, jan_10, 10 Poe Ct]
2023-02-10T18:47:18.0340294Z tests | [joe_4, 1000, 91000, jan_4, 4 Poe Ct]
2023-02-10T18:47:18.0340562Z tests | [joe_27, 3000, 83000, jill_27, 27 Eop Ct]
2023-02-10T18:47:18.0340823Z tests | [joe_8, 1000, 91000, jan_8, 8 Poe Ct]
2023-02-10T18:47:18.0341092Z tests | [joe_6, 1000, 91000, jan_6, 6 Poe Ct]
2023-02-10T18:47:18.0341348Z tests | [joe_5, 1000, 91000, jan_5, 5 Poe Ct]
2023-02-10T18:47:18.0341594Z tests | [joe_7, 1000, 91000, jan_7, 7 Poe Ct]
2023-02-10T18:47:18.0341837Z tests | [joe_3, 1000, 91000, jan_3, 3 Poe Ct]
2023-02-10T18:47:18.0342093Z tests | [joe_11, 1000, 91000, jan_11, 11 Poe Ct]
2023-02-10T18:47:18.0342352Z tests | [joe_12, 1000, 91000, jan_12, 12 Poe Ct]
2023-02-10T18:47:18.0342602Z tests | [joe_13, 1000, 91000, jan_13, 13 Poe Ct]
2023-02-10T18:47:18.0342858Z tests | [joe_2, 1000, 91000, jan_2, 2 Poe Ct]
2023-02-10T18:47:18.0343123Z tests | [joe_9, 1000, 91000, jan_9, 9 Poe Ct]
2023-02-10T18:47:18.0343634Z tests | at io.trino.tests.product.hive.TestHiveMerge.verifyOnTrinoAndHiveFromQueryResults(TestHiveMerge.java:253)
2023-02-10T18:47:18.0344341Z tests | at io.trino.tests.product.hive.TestHiveMerge.testMergeMultipleOperationsInternal(TestHiveMerge.java:218)
2023-02-10T18:47:18.0345040Z tests | at io.trino.tests.product.hive.TestHiveMerge.lambda$testMergeMultipleOperationsUnbucketedPartitioned$9(TestHiveMerge.java:184)
2023-02-10T18:47:18.0345672Z tests | at io.trino.tests.product.hive.TestHiveMerge.withTemporaryTable(TestHiveMerge.java:764)
2023-02-10T18:47:18.0346363Z tests | at io.trino.tests.product.hive.TestHiveMerge.testMergeMultipleOperationsUnbucketedPartitioned(TestHiveMerge.java:182)
2023-02-10T18:47:18.0347097Z tests | at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
2023-02-10T18:47:18.0347658Z tests | at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
2023-02-10T18:47:18.0348274Z tests | at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
2023-02-10T18:47:18.0348789Z tests | at java.base/java.lang.reflect.Method.invoke(Method.java:568)
2023-02-10T18:47:18.0349285Z tests | at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:104)
2023-02-10T18:47:18.0349819Z tests | at org.testng.internal.InvokeMethodRunnable.runOne(InvokeMethodRunnable.java:54)
2023-02-10T18:47:18.0350315Z tests | at org.testng.internal.InvokeMethodRunnable.run(InvokeMethodRunnable.java:44)
2023-02-10T18:47:18.0350797Z tests | at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
2023-02-10T18:47:18.0351251Z tests | at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
2023-02-10T18:47:18.0351740Z tests | at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
2023-02-10T18:47:18.0352273Z tests | at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
2023-02-10T18:47:18.0352753Z tests | at java.base/java.lang.Thread.run(Thread.java:833)
cc @djsstarburst @electrum this looks like correctness
https://github.com/trinodb/trino/pull/16674/checks?check_run_id=12227282972
java.lang.AssertionError: Could not find rows:
[joe_16, 3000, 83000, jill_16, 16 Eop Ct]
[joe_17, 3000, 83000, jill_17, 17 Eop Ct]
[joe_18, 3000, 83000, jill_18, 18 Eop Ct]
[joe_19, 3000, 83000, jill_19, 19 Eop Ct]
[joe_20, 3000, 83000, jill_20, 20 Eop Ct]
[joe_21, 3000, 83000, jill_21, 21 Eop Ct]
[joe_22, 3000, 83000, jill_22, 22 Eop Ct]
[joe_23, 3000, 83000, jill_23, 23 Eop Ct]
[joe_24, 3000, 83000, jill_24, 24 Eop Ct]
[joe_25, 3000, 83000, jill_25, 25 Eop Ct]
[joe_26, 3000, 83000, jill_26, 26 Eop Ct]
[joe_27, 3000, 83000, jill_27, 27 Eop Ct]
[joe_28, 3000, 83000, jill_28, 28 Eop Ct]
[joe_29, 3000, 83000, jill_29, 29 Eop Ct]
[joe_31, 3000, 83000, jill_31, 31 Eop Ct]
actual rows:
[joe_14, 1000, 91000, jan_14, 14 Poe Ct]
[joe_8, 1000, 91000, jan_8, 8 Poe Ct]
[joe_10, 1000, 91000, jan_10, 10 Poe Ct]
[joe_6, 1000, 91000, jan_6, 6 Poe Ct]
[joe_9, 1000, 91000, jan_9, 9 Poe Ct]
[joe_11, 1000, 91000, jan_11, 11 Poe Ct]
[joe_3, 1000, 91000, jan_3, 3 Poe Ct]
[joe_1, 1000, 91000, jan_1, 1 Poe Ct]
[joe_7, 1000, 91000, jan_7, 7 Poe Ct]
[joe_30, 3000, 83000, jill_30, 30 Eop Ct]
[joe_2, 1000, 91000, jan_2, 2 Poe Ct]
[joe_15, 1000, 91000, jan_15, 15 Poe Ct]
[joe_4, 1000, 91000, jan_4, 4 Poe Ct]
[joe_5, 1000, 91000, jan_5, 5 Poe Ct]
[joe_13, 1000, 91000, jan_13, 13 Poe Ct]
[joe_12, 1000, 91000, jan_12, 12 Poe Ct]
at io.trino.tests.product.hive.TestHiveMerge.verifyOnTrinoAndHiveFromQueryResults(TestHiveMerge.java:253)
at io.trino.tests.product.hive.TestHiveMerge.testMergeMultipleOperationsInternal(TestHiveMerge.java:218)
at io.trino.tests.product.hive.TestHiveMerge.lambda$testMergeMultipleOperationsUnbucketedPartitioned$9(TestHiveMerge.java:184)
at io.trino.tests.product.hive.TestHiveMerge.withTemporaryTable(TestHiveMerge.java:764)
at io.trino.tests.product.hive.TestHiveMerge.testMergeMultipleOperationsUnbucketedPartitioned(TestHiveMerge.java:182)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:104)
at org.testng.internal.InvokeMethodRunnable.runOne(InvokeMethodRunnable.java:54)
at org.testng.internal.InvokeMethodRunnable.run(InvokeMethodRunnable.java:44)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)
I've just run TestHiveMerge.testMergeMultipleOperationsUnbucketedPartitioned
25 times on my laptop using the usual product test runner. The test succeeded all 25 times. How frequent are the CI failures? Is there some reason it would fail in CI but not on my laptop?
How frequent are the CI failures?
According to https://nineinchnick.github.io/trino-cicd/reports/flaky/#flaky-tests-in-last-30-days: TestHiveMerge
failed once in past 30 days.
https://github.com/trinodb/trino/pull/16796/checks?check_run_id=12391653065
2023-03-30 14:13:58 INFO: FAILURE / io.trino.tests.product.hive.TestHiveMerge.testMergeMultipleOperationsUnbucketedPartitioned (Groups: hive_transactional) took 33.3 seconds
tests | 2023-03-30 14:13:58 SEVERE: Failure cause:
tests | java.lang.AssertionError: Could not find rows:
tests | [joe_32, 5000, 85000, jen_32, 32 Poe Ct]
tests | [joe_33, 5000, 85000, jen_33, 33 Poe Ct]
tests | [joe_34, 5000, 85000, jen_34, 34 Poe Ct]
tests | [joe_35, 5000, 85000, jen_35, 35 Poe Ct]
tests | [joe_36, 5000, 85000, jen_36, 36 Poe Ct]
tests | [joe_37, 5000, 85000, jen_37, 37 Poe Ct]
tests | [joe_38, 5000, 85000, jen_38, 38 Poe Ct]
tests | [joe_39, 5000, 85000, jen_39, 39 Poe Ct]
tests | [joe_40, 5000, 85000, jen_40, 40 Poe Ct]
tests | [joe_41, 5000, 85000, jen_41, 41 Poe Ct]
tests | [joe_42, 5000, 85000, jen_42, 42 Poe Ct]
tests | [joe_43, 5000, 85000, jen_43, 43 Poe Ct]
tests | [joe_44, 5000, 85000, jen_44, 44 Poe Ct]
tests | [joe_45, 5000, 85000, jen_45, 45 Poe Ct]
tests | [joe_46, 5000, 85000, jen_46, 46 Poe Ct]
tests | [joe_47, 5000, 85000, jen_47, 47 Poe Ct]
In all cases I can find in which TestHiveMerge. testMergeMultipleOperationsUnbucketedPartitioned
failed, the failure was that HiveConnector.finishMerge
(indirectly) calls HDFS to perform the transaction writes, and HDFS could not write a file at commit time. Often the file is the _orc_acid_version
file for the bucket, but there are other examples of write failures for bucket files. It ends up as a "correctness" issue because the MERGE operation is not aborted by the exception thrown when trying to write the file. I see where the exceptions are being thrown, and there is no intervening exception catcher, so I don't understand why the MERGE operation succeeded.
In all cases HDFS class org.apache.hadoop.hdfs.server.blockmanagemen.BlockManager.chooseTarget4NewBlock
is throwing the exception, saying that no target nodes were found to handle the write request:
if (blockType == BlockType.CONTIGUOUS && targets.length < minReplication) {
throw new IOException(String.format(errorMessage, src,
targets.length, minReplication, "minReplication nodes",
getDatanodeManager().getNetworkTopology().getNumOfLeaves(),
(excludedNodes == null? "no": excludedNodes.size())));
}
io.trino.spi.TrinoException: Exception writing _orc_acid_version file for delta directory: hdfs://hadoop-master:9000/user/hive/warehouse/test_merge_multiple_true_none_6us5hiuuif/address=10 Poe Ct/customer=joe_10/delete_delta_0000004_0000004_0000
at io.trino.plugin.hive.HiveMetadata.createOrcAcidVersionFile(HiveMetadata.java:2251)
at io.trino.plugin.hive.HiveMetadata.lambda$finishMerge$58(HiveMetadata.java:1956)
at java.base/java.util.Optional.ifPresent(Optional.java:178)
...
Caused by: org.apache.hadoop.ipc.RemoteException: File /user/hive/warehouse/test_merge_multiple_true_none_6us5hiuuif/address=10 Poe Ct/customer=joe_10/delete_delta_0000004_0000004_0000/_orc_acid_version could only be written to 0 of the 1 minReplication nodes. There are 1 datanode(s) running and no node(s) are excluded in this operation.
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2121)
at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:286)
...
It seems clear that the exception should be bubble up to the top and cause the transaction to fail instead of causing a correctness problem.
https://github.com/trinodb/trino/actions/runs/5419044626/jobs/9852081055?pr=15821
tests | 2023-06-30 11:16:12 INFO: FAILURE / io.trino.tests.product.hive.TestHiveMerge.testMergeMultipleOperationsUnbucketedPartitioned (Groups: hive_transactional) took 32.2 seconds
TestHiveMerge > testMergeMultipleOperationsUnbucketedPartitioned [groups: hive_transactional]
tests are failing when we are trying to increase the default value of task_writer_count to 32 for partitioned writes.PR: https://github.com/trinodb/trino/pull/14553
CI: https://github.com/trinodb/trino/pull/14553/checks?check_run_id=8882217060
cc @electrum @djsstarburst