Closed elharo closed 7 months ago
A Velox problem?
[ERROR] com.facebook.presto.nativeworker.TestPrestoNativeGeneralQueriesThrift.testSystemTables Time elapsed: 0.877 s <<< FAILURE!
java.lang.AssertionError: Execution of 'actual' query failed: SELECT * FROM (SELECT DISTINCT regionkey FROM tmp_presto_8d2fe12547ca472093a6cdf415f47510) t INNER JOIN (SELECT regionkey FROM "tmp_presto_8d2fe12547ca472093a6cdf415f47510$partitions") p ON t.regionkey = p.regionkey
at org.testng.Assert.fail(Assert.java:98)
at com.facebook.presto.tests.QueryAssertions.assertQuery(QueryAssertions.java:178)
at com.facebook.presto.tests.QueryAssertions.assertQuery(QueryAssertions.java:106)
at com.facebook.presto.tests.AbstractTestQueryFramework.assertQuery(AbstractTestQueryFramework.java:154)
at com.facebook.presto.nativeworker.AbstractTestNativeGeneralQueries.testSystemTables(AbstractTestNativeGeneralQueries.java:1448)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.testng.internal.invokers.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:135)
at org.testng.internal.invokers.TestInvoker.invokeMethod(TestInvoker.java:673)
at org.testng.internal.invokers.TestInvoker.invokeTestMethod(TestInvoker.java:220)
at org.testng.internal.invokers.MethodRunner.runInSequence(MethodRunner.java:50)
at org.testng.internal.invokers.TestInvoker$MethodInvocationAgent.invoke(TestInvoker.java:945)
at org.testng.internal.invokers.TestInvoker.invokeTestMethods(TestInvoker.java:193)
at org.testng.internal.invokers.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:146)
at org.testng.internal.invokers.TestMethodWorker.run(TestMethodWorker.java:128)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.RuntimeException: buffer != nullptr getData received after its buffer is deleted. Destination: 3, sequence: 0
at com.facebook.presto.tests.AbstractTestingPrestoClient.execute(AbstractTestingPrestoClient.java:124)
at com.facebook.presto.tests.DistributedQueryRunner.execute(DistributedQueryRunner.java:733)
at com.facebook.presto.tests.DistributedQueryRunner.execute(DistributedQueryRunner.java:701)
at com.facebook.presto.tests.QueryAssertions.assertQuery(QueryAssertions.java:175)
... 18 more
Caused by: VeloxRuntimeError: buffer != nullptr getData received after its buffer is deleted. Destination: 3, sequence: 0
at Unknown.# 0 _ZN8facebook5velox7process10StackTraceC1Ei(Unknown Source)
at Unknown.# 1
It looks like a real problem in Velox / Prestissimo
I managed to capture a following sequence of events:
getResult
request. Since the upstream task doesn't exist yet the request is getting parked and put on hold: https://github.com/prestodb/presto/blob/master/presto-native-execution/presto_cpp/main/TaskManager.cpp#L875getResult
requests: https://github.com/prestodb/presto/blob/master/presto-native-execution/presto_cpp/main/TaskManager.cpp#L524abort
for it's buffers: https://github.com/prestodb/presto/blob/master/presto-native-execution/presto_cpp/main/TaskManager.cpp#L328getDataForResultRequests
fails on getData
Full stack trace:
VeloxRuntimeError: buffer != nullptr getData received after its buffer is deleted. Destination: 15, sequence: 0
at Unknown.# 0 facebook::velox::VeloxException::VeloxException(char const*, unsigned long, char const*, std::basic_string_view<char, std::char_traits<char> >, std::basic_string_view<char, std::char_traits<char> >, std::basic_string_view<char, std::char_traits<char> >, std::basic_string_view<char, std::char_traits<char> >, bool, facebook::velox::VeloxException::Type, std::basic_string_view<char, std::char_traits<char> >)(Unknown Source)
at Unknown.# 1 void facebook::velox::detail::veloxCheckFail<facebook::velox::VeloxRuntimeError, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&>(facebook::velox::detail::VeloxCheckFailArgs const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)(Unknown Source)
at Unknown.# 2 facebook::velox::exec::OutputBuffer::getData(int, unsigned long, long, std::function<void (std::vector<std::unique_ptr<folly::IOBuf, std::default_delete<folly::IOBuf> >, std::allocator<std::unique_ptr<folly::IOBuf, std::default_delete<folly::IOBuf> > > >, long, std::vector<long, std::allocator<long> >)>, std::function<bool ()>)(Unknown Source)
at Unknown.# 3 facebook::velox::exec::OutputBufferManager::getData(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, unsigned long, long, std::function<void (std::vector<std::unique_ptr<folly::IOBuf, std::default_delete<folly::IOBuf> >, std::allocator<std::unique_ptr<folly::IOBuf, std::default_delete<folly::IOBuf> > > >, long, std::vector<long, std::allocator<long> >)>, std::function<bool ()>)(Unknown Source)
at Unknown.# 4 facebook::presto::(anonymous namespace)::getData(std::shared_ptr<facebook::presto::PromiseHolder<std::unique_ptr<facebook::presto::Result, std::default_delete<facebook::presto::Result> > > >, std::weak_ptr<facebook::presto::http::CallbackRequestHandlerState>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long, long, facebook::presto::protocol::DataSize, facebook::velox::exec::OutputBufferManager&)(Unknown Source)
at Unknown.# 5 facebook::presto::TaskManager::getDataForResultRequests(std::unordered_map<long, std::shared_ptr<facebook::presto::ResultRequest>, std::hash<long>, std::equal_to<long>, std::allocator<std::pair<long const, std::shared_ptr<facebook::presto::ResultRequest> > > > const&)(Unknown Source)
at Unknown.# 6 facebook::presto::TaskManager::createOrUpdateTask(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, facebook::velox::core::PlanFragment const&, std::vector<facebook::presto::protocol::TaskSource, std::allocator<facebook::presto::protocol::TaskSource> > const&, facebook::presto::protocol::OutputBuffers const&, std::shared_ptr<facebook::velox::core::QueryCtx>, long)(Unknown Source)
at Unknown.# 7 facebook::presto::TaskManager::createOrUpdateTask(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, facebook::presto::protocol::TaskUpdateRequest const&, facebook::velox::core::PlanFragment const&, std::shared_ptr<facebook::velox::core::QueryCtx>, long)(Unknown Source)
at Unknown.# 8 std::_Function_handler<std::unique_ptr<facebook::presto::protocol::TaskInfo, std::default_delete<facebook::presto::protocol::TaskInfo> > (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long), facebook::presto::TaskResource::createOrUpdateTask(proxygen::HTTPMessage*, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&)::$_18>::_M_invoke(std::_Any_data const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long&&)(Unknown Source)
at Unknown.# 9 _ZN5folly6detail8function14FunctionTraitsIFvRNS_7futures6detail8CoreBaseEONS_8Executor9KeepAliveIS7_EEPNS_17exception_wrapperEEE7callBigIZNS4_4CoreINS_4UnitEE11setCallbackIZNS4_10FutureBaseISH_E18thenImplementationIZNOS_6FutureISH_E9thenValueIZNS_3viaIZZN8facebook6presto12TaskResource22createOrUpdateTaskImplEPN8proxygen11HTTPMessageERKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS13_EERKSt8functionIFSt10unique_ptrINSS_8protocol8TaskInfoESt14default_deleteIS1B_EERKS13_S1G_lEEENK3$_2clESW_RKSX_IS19_INS_5IOBufES1C_IS1M_EESaIS1O_EEPNSU_15ResponseHandlerESt10shared_ptrINSS_4http27CallbackRequestHandlerStateEEEUlvE_EENSN_INS_20isFutureOrSemiFutureIDTclclsr3stdE7declvalIT_EEEEE5InnerEEES9_OS21_EUlS26_E_EENSN_INS4_19valueCallableResultISH_S21_E10value_typeEEES26_EUlSA_ONS_3TryISH_EEE_NS4_25tryExecutorCallableResultISH_S2F_vEEEENSt9enable_ifIXntsrNT0_13ReturnsFutureE5valueENS2J_6ReturnEE4typeES26_S2J_NS4_18InlineContinuationEEUlSA_S2E_E_EEvS26_OS1V_INS_14RequestContextEES2N_EUlS6_SA_SC_E_EEvS6_SA_SC_RNS1_4DataE(Unknown Source)
at Unknown.# 10 void folly::detail::function::FunctionTraits<void ()>::callSmall<folly::Executor::KeepAlive<folly::Executor>::add<folly::futures::detail::CoreBase::doCallback(folly::Executor::KeepAlive<folly::Executor>&&, folly::futures::detail::State)::$_0>(folly::futures::detail::CoreBase::doCallback(folly::Executor::KeepAlive<folly::Executor>&&, folly::futures::detail::State)::$_0&&) &&::{lambda()#1}>(folly::detail::function::Data&)(Unknown Source)
at Unknown.# 11 folly::ThreadPoolExecutor::runTask(std::shared_ptr<folly::ThreadPoolExecutor::Thread> const&, folly::ThreadPoolExecutor::Task&&)(Unknown Source)
at Unknown.# 12 folly::CPUThreadPoolExecutor::threadRun(std::shared_ptr<folly::ThreadPoolExecutor::Thread>)(Unknown Source)
at Unknown.# 13 void folly::detail::function::FunctionTraits<void ()>::callSmall<std::_Bind<void (folly::ThreadPoolExecutor::*(folly::ThreadPoolExecutor*, std::shared_ptr<folly::ThreadPoolExecutor::Thread>))(std::shared_ptr<folly::ThreadPoolExecutor::Thread>)> >(folly::detail::function::Data&)(Unknown Source)
at Unknown.# 14 execute_native_thread_routine(Unknown Source)
at Unknown.# 15 start_thread(Unknown Source)
at Unknown.# 16 __clone3(Unknown Source)
In Presto the buffer is not deleted on abort
and getData
requests for an aborted buffer are allowed. Same can be done in Velox (https://github.com/facebookincubator/velox/pull/9207). It looks like some of the other operations are already allowed to come out of order (for example an acknowledge after a delete https://github.com/facebookincubator/velox/blob/main/velox/exec/OutputBuffer.cpp#L636).
An alternative would be to try to abort pending requests, but it might be more difficult due to inherent "raciness" of these operations. Also it is unclear if there are other sequences of requests possible that can lead to a query failure.
@mbasmanova @xiaoxmeng Thoughts?
In Presto the buffer is not deleted on abort and getData requests for an aborted buffer are allowed.
I see. It would be nice to update documentation in https://prestodb.io/docs/current/develop/worker-protocol.html to clarify that.
Let me check https://github.com/facebookincubator/velox/pull/9207
I'm seeing this issue in production from time to time as well: 20240324_170232_00988_y57ss
In Presto the buffer is not deleted on abort and getData requests for an aborted buffer are allowed.
I see. It would be nice to update documentation in https://prestodb.io/docs/current/develop/worker-protocol.html to clarify that.
Noted, thanks!
This is a known issue that happens ~10 times a day with failure message VeloxRuntimeError: buffer != nullptr getData received after its buffer is deleted. Destination: <N>, sequence: 0
where N is a integer. Thanks @arhimondr for debugging this complex issue.
TestPrestoNativeGeneralQueriesThrift>AbstractTestNativeGeneralQueries.testSystemTables:1413->AbstractTestQueryFramework.assertQuery:154 Execution of 'actual' query failed: SELECT * FROM (SELECT DISTINCT regionkey FROM tmp_presto_30a13f6f35074dde851d6931c829c9fc) t INNER JOIN (SELECT regionkey FROM "tmp_presto_30a13f6f35074dde851d6931c829c9fc$partitions") p ON t.regionkey = p.regionkey
We've seen a lot of failures in TestPrestoNativeGeneralQueriesThrift. It's possible the whole class (or the superclass?) is flaky rather than individual test methods.