trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
9.88k stars 2.86k forks source link

SQL Merge (on partitioned table) in Hive drops rows when writer count is increased from 1 #14637

Open gaurav8297 opened 1 year ago

gaurav8297 commented 1 year ago

TestHiveMerge > testMergeMultipleOperationsUnbucketedPartitioned [groups: hive_transactional] tests are failing when we are trying to increase the default value of task_writer_count to 32 for partitioned writes.

PR: https://github.com/trinodb/trino/pull/14553

CI: https://github.com/trinodb/trino/pull/14553/checks?check_run_id=8882217060

java.lang.AssertionError: Could not find rows:
[joe_32, 5000, 85000, jen_32, 32 Poe Ct]
[joe_33, 5000, 85000, jen_33, 33 Poe Ct]
[joe_34, 5000, 85000, jen_34, 34 Poe Ct]
[joe_35, 5000, 85000, jen_35, 35 Poe Ct]
[joe_36, 5000, 85000, jen_36, 36 Poe Ct]
[joe_37, 5000, 85000, jen_37, 37 Poe Ct]
[joe_38, 5000, 85000, jen_38, 38 Poe Ct]
[joe_39, 5000, 85000, jen_39, 39 Poe Ct]
[joe_40, 5000, 85000, jen_40, 40 Poe Ct]
[joe_41, 5000, 85000, jen_41, 41 Poe Ct]
[joe_42, 5000, 85000, jen_42, 42 Poe Ct]
[joe_43, 5000, 85000, jen_43, 43 Poe Ct]
[joe_44, 5000, 85000, jen_44, 44 Poe Ct]
[joe_45, 5000, 85000, jen_45, 45 Poe Ct]
[joe_46, 5000, 85000, jen_46, 46 Poe Ct]
[joe_47, 5000, 85000, jen_47, 47 Poe Ct]

actual rows:
[joe_5, 5000, 85000, jen_5, 5 Poe Ct]
[joe_4, 5000, 85000, jen_4, 4 Poe Ct]
[joe_43, 5000, 60000, jen_43, 43 Poe Ct]
[joe_39, 5000, 60000, jen_39, 39 Poe Ct]
[joe_42, 5000, 60000, jen_42, 42 Poe Ct]
[joe_26, 3000, 60000, jill_26, 26 Eop Ct]
[joe_32, 5000, 60000, jen_32, 32 Poe Ct]
[joe_9, 5000, 85000, jen_9, 9 Poe Ct]
[joe_3, 5000, 85000, jen_3, 3 Poe Ct]
[joe_19, 3000, 60000, jill_19, 19 Eop Ct]
[jack_32, 4000, 74000, jan_32, 32 Poe Ct]
[jack_41, 4000, 74000, jan_41, 41 Poe Ct]
[joe_21, 3000, 60000, jill_21, 21 Eop Ct]
[joe_44, 5000, 60000, jen_44, 44 Poe Ct]
[joe_31, 3000, 60000, jill_31, 31 Eop Ct]
[jack_37, 4000, 74000, jan_37, 37 Poe Ct]
[joe_27, 3000, 60000, jill_27, 27 Eop Ct]
[joe_47, 5000, 60000, jen_47, 47 Poe Ct]
[jack_40, 4000, 74000, jan_40, 40 Poe Ct]
[joe_37, 5000, 60000, jen_37, 37 Poe Ct]
[joe_12, 5000, 85000, jen_12, 12 Poe Ct]
[joe_13, 5000, 85000, jen_13, 13 Poe Ct]
[joe_29, 3000, 60000, jill_29, 29 Eop Ct]
[joe_22, 3000, 60000, jill_22, 22 Eop Ct]
[jack_35, 4000, 74000, jan_35, 35 Poe Ct]
[jack_33, 4000, 74000, jan_33, 33 Poe Ct]
[joe_15, 5000, 85000, jen_15, 15 Poe Ct]
[joe_8, 5000, 85000, jen_8, 8 Poe Ct]
[joe_38, 5000, 60000, jen_38, 38 Poe Ct]
[joe_14, 5000, 85000, jen_14, 14 Poe Ct]
[joe_17, 3000, 60000, jill_17, 17 Eop Ct]
[jack_36, 4000, 74000, jan_36, 36 Poe Ct]
[joe_35, 5000, 60000, jen_35, 35 Poe Ct]
[jack_43, 4000, 74000, jan_43, 43 Poe Ct]
[joe_2, 5000, 85000, jen_2, 2 Poe Ct]
[joe_36, 5000, 60000, jen_36, 36 Poe Ct]
[jack_39, 4000, 74000, jan_39, 39 Poe Ct]
[joe_18, 3000, 60000, jill_18, 18 Eop Ct]
[joe_46, 5000, 60000, jen_46, 46 Poe Ct]
[joe_25, 3000, 60000, jill_25, 25 Eop Ct]
[joe_20, 3000, 60000, jill_20, 20 Eop Ct]
[jack_38, 4000, 74000, jan_38, 38 Poe Ct]
[joe_24, 3000, 60000, jill_24, 24 Eop Ct]
[jack_42, 4000, 74000, jan_42, 42 Poe Ct]
[joe_16, 3000, 60000, jill_16, 16 Eop Ct]
[jack_44, 4000, 74000, jan_44, 44 Poe Ct]
[joe_1, 5000, 85000, jen_1, 1 Poe Ct]
[joe_6, 5000, 85000, jen_6, 6 Poe Ct]
[jack_45, 4000, 74000, jan_45, 45 Poe Ct]
[joe_45, 5000, 60000, jen_45, 45 Poe Ct]
[jack_46, 4000, 74000, jan_46, 46 Poe Ct]
[joe_11, 5000, 85000, jen_11, 11 Poe Ct]
[jack_47, 4000, 74000, jan_47, 47 Poe Ct]
[joe_34, 5000, 60000, jen_34, 34 Poe Ct]
[joe_40, 5000, 60000, jen_40, 40 Poe Ct]
[joe_7, 5000, 85000, jen_7, 7 Poe Ct]
[joe_23, 3000, 60000, jill_23, 23 Eop Ct]
[joe_30, 3000, 60000, jill_30, 30 Eop Ct]
[jack_34, 4000, 74000, jan_34, 34 Poe Ct]
[joe_10, 5000, 85000, jen_10, 10 Poe Ct]
[joe_33, 5000, 60000, jen_33, 33 Poe Ct]
[joe_41, 5000, 60000, jen_41, 41 Poe Ct]
[joe_28, 3000, 60000, jill_28, 28 Eop Ct]
    at io.trino.tests.product.hive.TestHiveMerge.verifyOnTrinoAndHiveFromQueryResults(TestHiveMerge.java:253)
    at io.trino.tests.product.hive.TestHiveMerge.testMergeMultipleOperationsInternal(TestHiveMerge.java:247)
    at io.trino.tests.product.hive.TestHiveMerge.lambda$testMergeMultipleOperationsUnbucketedPartitioned$9(TestHiveMerge.java:184)
    at io.trino.tests.product.hive.TestHiveMerge.withTemporaryTable(TestHiveMerge.java:730)
    at io.trino.tests.product.hive.TestHiveMerge.testMergeMultipleOperationsUnbucketedPartitioned(TestHiveMerge.java:182)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:568)
    at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:104)
    at org.testng.internal.InvokeMethodRunnable.runOne(InvokeMethodRunnable.java:54)
    at org.testng.internal.InvokeMethodRunnable.run(InvokeMethodRunnable.java:44)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:833)

cc @electrum @djsstarburst

findepi commented 1 year ago

let's merge this with https://github.com/trinodb/trino/issues/14516 i don't think there is a reason to treat them separate for now

sopel39 commented 1 year ago

I think it's actually a product bug. Some rows are missing when writer count is increased with MERGE.

findepi commented 1 year ago

https://github.com/trinodb/trino/actions/runs/4115290901/jobs/7108629638

TestHiveMerge > testMergeMultipleOperationsUnbucketedPartitioned [groups: hive_transactional]
Could not find rows:
[joe_32, 5000, 85000, jen_32, 32 Poe Ct]
[joe_33, 5000, 85000, jen_33, 33 Poe Ct]
[joe_34, 5000, 85000, jen_34, 34 Poe Ct]
[joe_35, 5000, 85000, jen_35, 35 Poe Ct]
[joe_36, 5000, 85000, jen_36, 36 Poe Ct]
[joe_37, 5000, 85000, jen_37, 37 Poe Ct]
[joe_38, 5000, 85000, jen_38, 38 Poe Ct]
[joe_39, 5000, 85000, jen_39, 39 Poe Ct]
[joe_40, 5000, 85000, jen_40, 40 Poe Ct]
[joe_41, 5000, 85000, jen_41, 41 Poe Ct]
[joe_42, 5000, 85000, jen_42, 42 Poe Ct]
[joe_43, 5000, 85000, jen_43, 43 Poe Ct]
[joe_44, 5000, 85000, jen_44, 44 Poe Ct]
[joe_45, 5000, 85000, jen_45, 45 Poe Ct]
[joe_46, 5000, 85000, jen_46, 46 Poe Ct]
[joe_47, 5000, 85000, jen_47, 47 Poe Ct]

actual rows:
[jack_46, 4000, 74000, jan_46, 46 Poe Ct]
[joe_10, 5000, 85000, jen_10, 10 Poe Ct]
[joe_36, 5000, 60000, jen_36, 36 Poe Ct]
[joe_21, 3000, 60000, jill_21, 21 Eop Ct]
[joe_44, 5000, 60000, jen_44, 44 Poe Ct]
[jack_47, 4000, 74000, jan_47, 47 Poe Ct]
[jack_39, 4000, 74000, jan_39, 39 Poe Ct]
[joe_46, 5000, 60000, jen_46, 46 Poe Ct]
[jack_42, 4000, 74000, jan_42, 42 Poe Ct]
[joe_33, 5000, 60000, jen_33, 33 Poe Ct]
[joe_29, 3000, 60000, jill_29, 29 Eop Ct]
[joe_41, 5000, 60000, jen_41, 41 Poe Ct]
[joe_15, 5000, 85000, jen_15, 15 Poe Ct]
[jack_32, 4000, 74000, jan_32, 32 Poe Ct]
[joe_16, 3000, 60000, jill_16, 16 Eop Ct]
[joe_27, 3000, 60000, jill_27, 27 Eop Ct]
[joe_22, 3000, 60000, jill_22, 22 Eop Ct]
[joe_30, 3000, 60000, jill_30, 30 Eop Ct]
[joe_45, 5000, 60000, jen_45, 45 Poe Ct]
[joe_42, 5000, 60000, jen_42, 42 Poe Ct]
[joe_18, 3000, 60000, jill_18, 18 Eop Ct]
[joe_25, 3000, 60000, jill_25, 25 Eop Ct]
[joe_13, 5000, 85000, jen_13, 13 Poe Ct]
[joe_4, 5000, 85000, jen_4, 4 Poe Ct]
[joe_19, 3000, 60000, jill_19, 19 Eop Ct]
[joe_14, 5000, 85000, jen_14, 14 Poe Ct]
[joe_34, 5000, 60000, jen_34, 34 Poe Ct]
[joe_40, 5000, 60000, jen_40, 40 Poe Ct]
[jack_45, 4000, 74000, jan_45, 45 Poe Ct]
[jack_35, 4000, 74000, jan_35, 35 Poe Ct]
[joe_24, 3000, 60000, jill_24, 24 Eop Ct]
[joe_20, 3000, 60000, jill_20, 20 Eop Ct]
[jack_38, 4000, 74000, jan_38, 38 Poe Ct]
[joe_47, 5000, 60000, jen_47, 47 Poe Ct]
[jack_43, 4000, 74000, jan_43, 43 Poe Ct]
[jack_44, 4000, 74000, jan_44, 44 Poe Ct]
[joe_43, 5000, 60000, jen_43, 43 Poe Ct]
[joe_39, 5000, 60000, jen_39, 39 Poe Ct]
[joe_11, 5000, 85000, jen_11, 11 Poe Ct]
[joe_1, 5000, 85000, jen_1, 1 Poe Ct]
[joe_23, 3000, 60000, jill_23, 23 Eop Ct]
[jack_34, 4000, 74000, jan_34, 34 Poe Ct]
[joe_35, 5000, 60000, jen_35, 35 Poe Ct]
[joe_38, 5000, 60000, jen_38, 38 Poe Ct]
[joe_8, 5000, 85000, jen_8, 8 Poe Ct]
[joe_6, 5000, 85000, jen_6, 6 Poe Ct]
[jack_41, 4000, 74000, jan_41, 41 Poe Ct]
[jack_40, 4000, 74000, jan_40, 40 Poe Ct]
[joe_2, 5000, 85000, jen_2, 2 Poe Ct]
[joe_31, 3000, 60000, jill_31, 31 Eop Ct]
[joe_5, 5000, 85000, jen_5, 5 Poe Ct]
[joe_37, 5000, 60000, jen_37, 37 Poe Ct]
[joe_26, 3000, 60000, jill_26, 26 Eop Ct]
[jack_36, 4000, 74000, jan_36, 36 Poe Ct]
[joe_12, 5000, 85000, jen_12, 12 Poe Ct]
[joe_7, 5000, 85000, jen_7, 7 Poe Ct]
[joe_32, 5000, 60000, jen_32, 32 Poe Ct]
[joe_28, 3000, 60000, jill_28, 28 Eop Ct]
[jack_37, 4000, 74000, jan_37, 37 Poe Ct]
[joe_3, 5000, 85000, jen_3, 3 Poe Ct]
[joe_9, 5000, 85000, jen_9, 9 Poe Ct]
[jack_33, 4000, 74000, jan_33, 33 Poe Ct]
[joe_17, 3000, 60000, jill_17, 17 Eop Ct]

cc @djsstarburst

krvikash commented 1 year ago

Same error, but different values

https://github.com/trinodb/trino/actions/runs/4143381507/jobs/7169852819

2023-02-10T18:47:18.0284529Z tests               | 2023-02-11 00:32:18 INFO: FAILURE     /    io.trino.tests.product.hive.TestHiveMerge.testMergeMultipleOperationsUnbucketedPartitioned (Groups: hive_transactional) took 10.4 seconds
2023-02-10T18:47:18.0333685Z tests               | 2023-02-11 00:32:18 SEVERE: Failure cause:
2023-02-10T18:47:18.0334056Z tests               | java.lang.AssertionError: Could not find rows:
2023-02-10T18:47:18.0334416Z tests               | [joe_16, 3000, 83000, jill_16, 16 Eop Ct]
2023-02-10T18:47:18.0334694Z tests               | [joe_17, 3000, 83000, jill_17, 17 Eop Ct]
2023-02-10T18:47:18.0334984Z tests               | [joe_18, 3000, 83000, jill_18, 18 Eop Ct]
2023-02-10T18:47:18.0335250Z tests               | [joe_19, 3000, 83000, jill_19, 19 Eop Ct]
2023-02-10T18:47:18.0335514Z tests               | [joe_20, 3000, 83000, jill_20, 20 Eop Ct]
2023-02-10T18:47:18.0335771Z tests               | [joe_21, 3000, 83000, jill_21, 21 Eop Ct]
2023-02-10T18:47:18.0336019Z tests               | [joe_22, 3000, 83000, jill_22, 22 Eop Ct]
2023-02-10T18:47:18.0336268Z tests               | [joe_23, 3000, 83000, jill_23, 23 Eop Ct]
2023-02-10T18:47:18.0336528Z tests               | [joe_24, 3000, 83000, jill_24, 24 Eop Ct]
2023-02-10T18:47:18.0336777Z tests               | [joe_25, 3000, 83000, jill_25, 25 Eop Ct]
2023-02-10T18:47:18.0337027Z tests               | [joe_26, 3000, 83000, jill_26, 26 Eop Ct]
2023-02-10T18:47:18.0337278Z tests               | [joe_28, 3000, 83000, jill_28, 28 Eop Ct]
2023-02-10T18:47:18.0337526Z tests               | [joe_29, 3000, 83000, jill_29, 29 Eop Ct]
2023-02-10T18:47:18.0337948Z tests               | [joe_30, 3000, 83000, jill_30, 30 Eop Ct]
2023-02-10T18:47:18.0338210Z tests               | [joe_31, 3000, 83000, jill_31, 31 Eop Ct]
2023-02-10T18:47:18.0338440Z tests               | 
2023-02-10T18:47:18.0338665Z tests               | actual rows:
2023-02-10T18:47:18.0338940Z tests               | [joe_1, 1000, 91000, jan_1, 1 Poe Ct]
2023-02-10T18:47:18.0339208Z tests               | [joe_15, 1000, 91000, jan_15, 15 Poe Ct]
2023-02-10T18:47:18.0339482Z tests               | [joe_27, 3000, 83000, jill_27, 27 Eop Ct]
2023-02-10T18:47:18.0339746Z tests               | [joe_14, 1000, 91000, jan_14, 14 Poe Ct]
2023-02-10T18:47:18.0340023Z tests               | [joe_10, 1000, 91000, jan_10, 10 Poe Ct]
2023-02-10T18:47:18.0340294Z tests               | [joe_4, 1000, 91000, jan_4, 4 Poe Ct]
2023-02-10T18:47:18.0340562Z tests               | [joe_27, 3000, 83000, jill_27, 27 Eop Ct]
2023-02-10T18:47:18.0340823Z tests               | [joe_8, 1000, 91000, jan_8, 8 Poe Ct]
2023-02-10T18:47:18.0341092Z tests               | [joe_6, 1000, 91000, jan_6, 6 Poe Ct]
2023-02-10T18:47:18.0341348Z tests               | [joe_5, 1000, 91000, jan_5, 5 Poe Ct]
2023-02-10T18:47:18.0341594Z tests               | [joe_7, 1000, 91000, jan_7, 7 Poe Ct]
2023-02-10T18:47:18.0341837Z tests               | [joe_3, 1000, 91000, jan_3, 3 Poe Ct]
2023-02-10T18:47:18.0342093Z tests               | [joe_11, 1000, 91000, jan_11, 11 Poe Ct]
2023-02-10T18:47:18.0342352Z tests               | [joe_12, 1000, 91000, jan_12, 12 Poe Ct]
2023-02-10T18:47:18.0342602Z tests               | [joe_13, 1000, 91000, jan_13, 13 Poe Ct]
2023-02-10T18:47:18.0342858Z tests               | [joe_2, 1000, 91000, jan_2, 2 Poe Ct]
2023-02-10T18:47:18.0343123Z tests               | [joe_9, 1000, 91000, jan_9, 9 Poe Ct]
2023-02-10T18:47:18.0343634Z tests               |  at io.trino.tests.product.hive.TestHiveMerge.verifyOnTrinoAndHiveFromQueryResults(TestHiveMerge.java:253)
2023-02-10T18:47:18.0344341Z tests               |  at io.trino.tests.product.hive.TestHiveMerge.testMergeMultipleOperationsInternal(TestHiveMerge.java:218)
2023-02-10T18:47:18.0345040Z tests               |  at io.trino.tests.product.hive.TestHiveMerge.lambda$testMergeMultipleOperationsUnbucketedPartitioned$9(TestHiveMerge.java:184)
2023-02-10T18:47:18.0345672Z tests               |  at io.trino.tests.product.hive.TestHiveMerge.withTemporaryTable(TestHiveMerge.java:764)
2023-02-10T18:47:18.0346363Z tests               |  at io.trino.tests.product.hive.TestHiveMerge.testMergeMultipleOperationsUnbucketedPartitioned(TestHiveMerge.java:182)
2023-02-10T18:47:18.0347097Z tests               |  at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
2023-02-10T18:47:18.0347658Z tests               |  at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
2023-02-10T18:47:18.0348274Z tests               |  at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
2023-02-10T18:47:18.0348789Z tests               |  at java.base/java.lang.reflect.Method.invoke(Method.java:568)
2023-02-10T18:47:18.0349285Z tests               |  at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:104)
2023-02-10T18:47:18.0349819Z tests               |  at org.testng.internal.InvokeMethodRunnable.runOne(InvokeMethodRunnable.java:54)
2023-02-10T18:47:18.0350315Z tests               |  at org.testng.internal.InvokeMethodRunnable.run(InvokeMethodRunnable.java:44)
2023-02-10T18:47:18.0350797Z tests               |  at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
2023-02-10T18:47:18.0351251Z tests               |  at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
2023-02-10T18:47:18.0351740Z tests               |  at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
2023-02-10T18:47:18.0352273Z tests               |  at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
2023-02-10T18:47:18.0352753Z tests               |  at java.base/java.lang.Thread.run(Thread.java:833)
raunaqmorarka commented 1 year ago

Encountered in https://github.com/trinodb/trino/actions/runs/4217735931/jobs/7321893836

findepi commented 1 year ago

cc @djsstarburst @electrum this looks like correctness

krvikash commented 1 year ago

https://github.com/trinodb/trino/pull/16674/checks?check_run_id=12227282972

java.lang.AssertionError: Could not find rows:
[joe_16, 3000, 83000, jill_16, 16 Eop Ct]
[joe_17, 3000, 83000, jill_17, 17 Eop Ct]
[joe_18, 3000, 83000, jill_18, 18 Eop Ct]
[joe_19, 3000, 83000, jill_19, 19 Eop Ct]
[joe_20, 3000, 83000, jill_20, 20 Eop Ct]
[joe_21, 3000, 83000, jill_21, 21 Eop Ct]
[joe_22, 3000, 83000, jill_22, 22 Eop Ct]
[joe_23, 3000, 83000, jill_23, 23 Eop Ct]
[joe_24, 3000, 83000, jill_24, 24 Eop Ct]
[joe_25, 3000, 83000, jill_25, 25 Eop Ct]
[joe_26, 3000, 83000, jill_26, 26 Eop Ct]
[joe_27, 3000, 83000, jill_27, 27 Eop Ct]
[joe_28, 3000, 83000, jill_28, 28 Eop Ct]
[joe_29, 3000, 83000, jill_29, 29 Eop Ct]
[joe_31, 3000, 83000, jill_31, 31 Eop Ct]

actual rows:
[joe_14, 1000, 91000, jan_14, 14 Poe Ct]
[joe_8, 1000, 91000, jan_8, 8 Poe Ct]
[joe_10, 1000, 91000, jan_10, 10 Poe Ct]
[joe_6, 1000, 91000, jan_6, 6 Poe Ct]
[joe_9, 1000, 91000, jan_9, 9 Poe Ct]
[joe_11, 1000, 91000, jan_11, 11 Poe Ct]
[joe_3, 1000, 91000, jan_3, 3 Poe Ct]
[joe_1, 1000, 91000, jan_1, 1 Poe Ct]
[joe_7, 1000, 91000, jan_7, 7 Poe Ct]
[joe_30, 3000, 83000, jill_30, 30 Eop Ct]
[joe_2, 1000, 91000, jan_2, 2 Poe Ct]
[joe_15, 1000, 91000, jan_15, 15 Poe Ct]
[joe_4, 1000, 91000, jan_4, 4 Poe Ct]
[joe_5, 1000, 91000, jan_5, 5 Poe Ct]
[joe_13, 1000, 91000, jan_13, 13 Poe Ct]
[joe_12, 1000, 91000, jan_12, 12 Poe Ct]
    at io.trino.tests.product.hive.TestHiveMerge.verifyOnTrinoAndHiveFromQueryResults(TestHiveMerge.java:253)
    at io.trino.tests.product.hive.TestHiveMerge.testMergeMultipleOperationsInternal(TestHiveMerge.java:218)
    at io.trino.tests.product.hive.TestHiveMerge.lambda$testMergeMultipleOperationsUnbucketedPartitioned$9(TestHiveMerge.java:184)
    at io.trino.tests.product.hive.TestHiveMerge.withTemporaryTable(TestHiveMerge.java:764)
    at io.trino.tests.product.hive.TestHiveMerge.testMergeMultipleOperationsUnbucketedPartitioned(TestHiveMerge.java:182)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:568)
    at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:104)
    at org.testng.internal.InvokeMethodRunnable.runOne(InvokeMethodRunnable.java:54)
    at org.testng.internal.InvokeMethodRunnable.run(InvokeMethodRunnable.java:44)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:833)
djsstarburst commented 1 year ago

I've just run TestHiveMerge.testMergeMultipleOperationsUnbucketedPartitioned 25 times on my laptop using the usual product test runner. The test succeeded all 25 times. How frequent are the CI failures? Is there some reason it would fail in CI but not on my laptop?

findepi commented 1 year ago

How frequent are the CI failures?

According to https://nineinchnick.github.io/trino-cicd/reports/flaky/#flaky-tests-in-last-30-days: TestHiveMerge failed once in past 30 days.

findinpath commented 1 year ago

https://github.com/trinodb/trino/pull/16796/checks?check_run_id=12391653065

2023-03-30 14:13:58 INFO: FAILURE     /    io.trino.tests.product.hive.TestHiveMerge.testMergeMultipleOperationsUnbucketedPartitioned (Groups: hive_transactional) took 33.3 seconds
tests               | 2023-03-30 14:13:58 SEVERE: Failure cause:
tests               | java.lang.AssertionError: Could not find rows:
tests               | [joe_32, 5000, 85000, jen_32, 32 Poe Ct]
tests               | [joe_33, 5000, 85000, jen_33, 33 Poe Ct]
tests               | [joe_34, 5000, 85000, jen_34, 34 Poe Ct]
tests               | [joe_35, 5000, 85000, jen_35, 35 Poe Ct]
tests               | [joe_36, 5000, 85000, jen_36, 36 Poe Ct]
tests               | [joe_37, 5000, 85000, jen_37, 37 Poe Ct]
tests               | [joe_38, 5000, 85000, jen_38, 38 Poe Ct]
tests               | [joe_39, 5000, 85000, jen_39, 39 Poe Ct]
tests               | [joe_40, 5000, 85000, jen_40, 40 Poe Ct]
tests               | [joe_41, 5000, 85000, jen_41, 41 Poe Ct]
tests               | [joe_42, 5000, 85000, jen_42, 42 Poe Ct]
tests               | [joe_43, 5000, 85000, jen_43, 43 Poe Ct]
tests               | [joe_44, 5000, 85000, jen_44, 44 Poe Ct]
tests               | [joe_45, 5000, 85000, jen_45, 45 Poe Ct]
tests               | [joe_46, 5000, 85000, jen_46, 46 Poe Ct]
tests               | [joe_47, 5000, 85000, jen_47, 47 Poe Ct]
djsstarburst commented 1 year ago

In all cases I can find in which TestHiveMerge. testMergeMultipleOperationsUnbucketedPartitioned failed, the failure was that HiveConnector.finishMerge (indirectly) calls HDFS to perform the transaction writes, and HDFS could not write a file at commit time. Often the file is the _orc_acid_version file for the bucket, but there are other examples of write failures for bucket files. It ends up as a "correctness" issue because the MERGE operation is not aborted by the exception thrown when trying to write the file. I see where the exceptions are being thrown, and there is no intervening exception catcher, so I don't understand why the MERGE operation succeeded.

In all cases HDFS class org.apache.hadoop.hdfs.server.blockmanagemen.BlockManager.chooseTarget4NewBlock is throwing the exception, saying that no target nodes were found to handle the write request:

    if (blockType == BlockType.CONTIGUOUS && targets.length < minReplication) {
      throw new IOException(String.format(errorMessage, src,
          targets.length, minReplication, "minReplication nodes",
          getDatanodeManager().getNetworkTopology().getNumOfLeaves(),
          (excludedNodes == null? "no": excludedNodes.size())));
    }
io.trino.spi.TrinoException: Exception writing _orc_acid_version file for delta directory: hdfs://hadoop-master:9000/user/hive/warehouse/test_merge_multiple_true_none_6us5hiuuif/address=10 Poe Ct/customer=joe_10/delete_delta_0000004_0000004_0000
    at io.trino.plugin.hive.HiveMetadata.createOrcAcidVersionFile(HiveMetadata.java:2251)
    at io.trino.plugin.hive.HiveMetadata.lambda$finishMerge$58(HiveMetadata.java:1956)
    at java.base/java.util.Optional.ifPresent(Optional.java:178)
...
Caused by: org.apache.hadoop.ipc.RemoteException: File /user/hive/warehouse/test_merge_multiple_true_none_6us5hiuuif/address=10 Poe Ct/customer=joe_10/delete_delta_0000004_0000004_0000/_orc_acid_version could only be written to 0 of the 1 minReplication nodes. There are 1 datanode(s) running and no node(s) are excluded in this operation.
    at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2121)
    at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:286)
...

It seems clear that the exception should be bubble up to the top and cause the transaction to fail instead of causing a correctness problem.

ebyhr commented 1 year ago

https://github.com/trinodb/trino/actions/runs/5419044626/jobs/9852081055?pr=15821

tests               | 2023-06-30 11:16:12 INFO: FAILURE     /    io.trino.tests.product.hive.TestHiveMerge.testMergeMultipleOperationsUnbucketedPartitioned (Groups: hive_transactional) took 32.2 seconds