prestodb / presto

The official home of the Presto distributed SQL query engine for big data
http://prestodb.io
Apache License 2.0
15.97k stars 5.35k forks source link

Flaky test: native TaskManagerTest.buildSpillDirectoryFailure #23118

Open ZacBlanco opened 3 months ago

ZacBlanco commented 3 months ago

Your Environment

CircleCI

Expected Behavior

Passing test

Current Behavior

9: [ RUN      ] TaskManagerTest.buildSpillDirectoryFailure
9: E20240702 14:56:35.577245  6366 Exceptions.h:67] Line: /root/project/presto-native-execution/velox/velox/exec/Task.cpp:1869, Function:terminate, Expression:  Aborted for external error, Source: RUNTIME, ErrorCode: INVALID_STATE
9: E20240702 14:56:35.577641  6366 TaskManager.cpp:282] There are 1 zombie Task that satisfy cleanup conditions but could not be cleaned up, because the Task are referenced by more than 1 owners. RUNNING[0] FINISHED[0] CANCELED[0] ABORTED[1] FAILED[0]  Sample task IDs (shows only 20 IDs): 
9: E20240702 14:56:35.577682  6366 TaskManager.cpp:292] Zombie Task [1/1]: Extra Refs: 1, test.0.0.0.0
9: E20240702 14:56:38.767441  6366 Exceptions.h:67] Line: /root/project/presto-native-execution/velox/velox/exec/tests/utils/QueryAssertions.cpp:1458, Function:waitForAllTasksToBeDeleted, Expression: numDeletedTasks == numCreatedTasks (87 vs. 88) 88 tasks have been created while only 87 have been deleted after waiting for 3000000 us, Source: RUNTIME, ErrorCode: INVALID_STATE
9: unknown file: Failure
9: C++ exception with description "Exception: VeloxRuntimeError
9: Error Source: RUNTIME
9: Error Code: INVALID_STATE
9: Reason: (87 vs. 88) 88 tasks have been created while only 87 have been deleted after waiting for 3000000 us
9: Retriable: False
9: Expression: numDeletedTasks == numCreatedTasks
9: Function: waitForAllTasksToBeDeleted
9: File: /root/project/presto-native-execution/velox/velox/exec/tests/utils/QueryAssertions.cpp
9: Line: 1458
9: Stack trace:
9: # 0  
9: # 1  
9: # 2  
9: # 3  
9: # 4  
9: # 5  
9: # 6  
9: # 7  
9: # 8  
9: # 9  
9: # 10 
9: # 11 
9: # 12 
9: # 13 
9: # 14 
9: # 15 
9: # 16 
9: # 17 
9: # 18 
9: # 19 
9: # 20 
9: " thrown in the test body.
9: [  FAILED  ] TaskManagerTest.buildSpillDirectoryFailure (3297 ms)
9: [----------] 20 tests from TaskManagerTest (14259 ms total)

Steps to Reproduce

Unknown, seen in CI: https://app.circleci.com/pipelines/github/prestodb/presto/17482/workflows/5b249155-53d3-4a15-a85e-e52ca3a62f05/jobs/68947