prestodb / presto

The official home of the Presto distributed SQL query engine for big data
http://prestodb.io
Apache License 2.0
16.06k stars 5.38k forks source link

AbstractTestNativeGeneralQueries.testSystemTables is flaky #22129

Closed elharo closed 7 months ago

elharo commented 8 months ago

TestPrestoNativeGeneralQueriesThrift>AbstractTestNativeGeneralQueries.testSystemTables:1413->AbstractTestQueryFramework.assertQuery:154 Execution of 'actual' query failed: SELECT * FROM (SELECT DISTINCT regionkey FROM tmp_presto_30a13f6f35074dde851d6931c829c9fc) t INNER JOIN (SELECT regionkey FROM "tmp_presto_30a13f6f35074dde851d6931c829c9fc$partitions") p ON t.regionkey = p.regionkey

We've seen a lot of failures in TestPrestoNativeGeneralQueriesThrift. It's possible the whole class (or the superclass?) is flaky rather than individual test methods.

elharo commented 8 months ago

A Velox problem?

[ERROR] com.facebook.presto.nativeworker.TestPrestoNativeGeneralQueriesThrift.testSystemTables  Time elapsed: 0.877 s  <<< FAILURE!
java.lang.AssertionError: Execution of 'actual' query failed: SELECT * FROM    (SELECT DISTINCT regionkey FROM tmp_presto_8d2fe12547ca472093a6cdf415f47510) t INNER JOIN    (SELECT regionkey FROM "tmp_presto_8d2fe12547ca472093a6cdf415f47510$partitions") p ON t.regionkey = p.regionkey
    at org.testng.Assert.fail(Assert.java:98)
    at com.facebook.presto.tests.QueryAssertions.assertQuery(QueryAssertions.java:178)
    at com.facebook.presto.tests.QueryAssertions.assertQuery(QueryAssertions.java:106)
    at com.facebook.presto.tests.AbstractTestQueryFramework.assertQuery(AbstractTestQueryFramework.java:154)
    at com.facebook.presto.nativeworker.AbstractTestNativeGeneralQueries.testSystemTables(AbstractTestNativeGeneralQueries.java:1448)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.testng.internal.invokers.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:135)
    at org.testng.internal.invokers.TestInvoker.invokeMethod(TestInvoker.java:673)
    at org.testng.internal.invokers.TestInvoker.invokeTestMethod(TestInvoker.java:220)
    at org.testng.internal.invokers.MethodRunner.runInSequence(MethodRunner.java:50)
    at org.testng.internal.invokers.TestInvoker$MethodInvocationAgent.invoke(TestInvoker.java:945)
    at org.testng.internal.invokers.TestInvoker.invokeTestMethods(TestInvoker.java:193)
    at org.testng.internal.invokers.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:146)
    at org.testng.internal.invokers.TestMethodWorker.run(TestMethodWorker.java:128)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.RuntimeException: buffer != nullptr getData received after its buffer is deleted. Destination: 3, sequence: 0
    at com.facebook.presto.tests.AbstractTestingPrestoClient.execute(AbstractTestingPrestoClient.java:124)
    at com.facebook.presto.tests.DistributedQueryRunner.execute(DistributedQueryRunner.java:733)
    at com.facebook.presto.tests.DistributedQueryRunner.execute(DistributedQueryRunner.java:701)
    at com.facebook.presto.tests.QueryAssertions.assertQuery(QueryAssertions.java:175)
    ... 18 more
Caused by: VeloxRuntimeError: buffer != nullptr getData received after its buffer is deleted. Destination: 3, sequence: 0
    at Unknown.# 0  _ZN8facebook5velox7process10StackTraceC1Ei(Unknown Source)
    at Unknown.# 1 
arhimondr commented 8 months ago

It looks like a real problem in Velox / Prestissimo

I managed to capture a following sequence of events:

Full stack trace:

VeloxRuntimeError: buffer != nullptr getData received after its buffer is deleted. Destination: 15, sequence: 0
    at Unknown.# 0  facebook::velox::VeloxException::VeloxException(char const*, unsigned long, char const*, std::basic_string_view<char, std::char_traits<char> >, std::basic_string_view<char, std::char_traits<char> >, std::basic_string_view<char, std::char_traits<char> >, std::basic_string_view<char, std::char_traits<char> >, bool, facebook::velox::VeloxException::Type, std::basic_string_view<char, std::char_traits<char> >)(Unknown Source)
    at Unknown.# 1  void facebook::velox::detail::veloxCheckFail<facebook::velox::VeloxRuntimeError, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&>(facebook::velox::detail::VeloxCheckFailArgs const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)(Unknown Source)
    at Unknown.# 2  facebook::velox::exec::OutputBuffer::getData(int, unsigned long, long, std::function<void (std::vector<std::unique_ptr<folly::IOBuf, std::default_delete<folly::IOBuf> >, std::allocator<std::unique_ptr<folly::IOBuf, std::default_delete<folly::IOBuf> > > >, long, std::vector<long, std::allocator<long> >)>, std::function<bool ()>)(Unknown Source)
    at Unknown.# 3  facebook::velox::exec::OutputBufferManager::getData(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, unsigned long, long, std::function<void (std::vector<std::unique_ptr<folly::IOBuf, std::default_delete<folly::IOBuf> >, std::allocator<std::unique_ptr<folly::IOBuf, std::default_delete<folly::IOBuf> > > >, long, std::vector<long, std::allocator<long> >)>, std::function<bool ()>)(Unknown Source)
    at Unknown.# 4  facebook::presto::(anonymous namespace)::getData(std::shared_ptr<facebook::presto::PromiseHolder<std::unique_ptr<facebook::presto::Result, std::default_delete<facebook::presto::Result> > > >, std::weak_ptr<facebook::presto::http::CallbackRequestHandlerState>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long, long, facebook::presto::protocol::DataSize, facebook::velox::exec::OutputBufferManager&)(Unknown Source)
    at Unknown.# 5  facebook::presto::TaskManager::getDataForResultRequests(std::unordered_map<long, std::shared_ptr<facebook::presto::ResultRequest>, std::hash<long>, std::equal_to<long>, std::allocator<std::pair<long const, std::shared_ptr<facebook::presto::ResultRequest> > > > const&)(Unknown Source)
    at Unknown.# 6  facebook::presto::TaskManager::createOrUpdateTask(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, facebook::velox::core::PlanFragment const&, std::vector<facebook::presto::protocol::TaskSource, std::allocator<facebook::presto::protocol::TaskSource> > const&, facebook::presto::protocol::OutputBuffers const&, std::shared_ptr<facebook::velox::core::QueryCtx>, long)(Unknown Source)
    at Unknown.# 7  facebook::presto::TaskManager::createOrUpdateTask(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, facebook::presto::protocol::TaskUpdateRequest const&, facebook::velox::core::PlanFragment const&, std::shared_ptr<facebook::velox::core::QueryCtx>, long)(Unknown Source)
    at Unknown.# 8  std::_Function_handler<std::unique_ptr<facebook::presto::protocol::TaskInfo, std::default_delete<facebook::presto::protocol::TaskInfo> > (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long), facebook::presto::TaskResource::createOrUpdateTask(proxygen::HTTPMessage*, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&)::$_18>::_M_invoke(std::_Any_data const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long&&)(Unknown Source)
    at Unknown.# 9  _ZN5folly6detail8function14FunctionTraitsIFvRNS_7futures6detail8CoreBaseEONS_8Executor9KeepAliveIS7_EEPNS_17exception_wrapperEEE7callBigIZNS4_4CoreINS_4UnitEE11setCallbackIZNS4_10FutureBaseISH_E18thenImplementationIZNOS_6FutureISH_E9thenValueIZNS_3viaIZZN8facebook6presto12TaskResource22createOrUpdateTaskImplEPN8proxygen11HTTPMessageERKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS13_EERKSt8functionIFSt10unique_ptrINSS_8protocol8TaskInfoESt14default_deleteIS1B_EERKS13_S1G_lEEENK3$_2clESW_RKSX_IS19_INS_5IOBufES1C_IS1M_EESaIS1O_EEPNSU_15ResponseHandlerESt10shared_ptrINSS_4http27CallbackRequestHandlerStateEEEUlvE_EENSN_INS_20isFutureOrSemiFutureIDTclclsr3stdE7declvalIT_EEEEE5InnerEEES9_OS21_EUlS26_E_EENSN_INS4_19valueCallableResultISH_S21_E10value_typeEEES26_EUlSA_ONS_3TryISH_EEE_NS4_25tryExecutorCallableResultISH_S2F_vEEEENSt9enable_ifIXntsrNT0_13ReturnsFutureE5valueENS2J_6ReturnEE4typeES26_S2J_NS4_18InlineContinuationEEUlSA_S2E_E_EEvS26_OS1V_INS_14RequestContextEES2N_EUlS6_SA_SC_E_EEvS6_SA_SC_RNS1_4DataE(Unknown Source)
    at Unknown.# 10 void folly::detail::function::FunctionTraits<void ()>::callSmall<folly::Executor::KeepAlive<folly::Executor>::add<folly::futures::detail::CoreBase::doCallback(folly::Executor::KeepAlive<folly::Executor>&&, folly::futures::detail::State)::$_0>(folly::futures::detail::CoreBase::doCallback(folly::Executor::KeepAlive<folly::Executor>&&, folly::futures::detail::State)::$_0&&) &&::{lambda()#1}>(folly::detail::function::Data&)(Unknown Source)
    at Unknown.# 11 folly::ThreadPoolExecutor::runTask(std::shared_ptr<folly::ThreadPoolExecutor::Thread> const&, folly::ThreadPoolExecutor::Task&&)(Unknown Source)
    at Unknown.# 12 folly::CPUThreadPoolExecutor::threadRun(std::shared_ptr<folly::ThreadPoolExecutor::Thread>)(Unknown Source)
    at Unknown.# 13 void folly::detail::function::FunctionTraits<void ()>::callSmall<std::_Bind<void (folly::ThreadPoolExecutor::*(folly::ThreadPoolExecutor*, std::shared_ptr<folly::ThreadPoolExecutor::Thread>))(std::shared_ptr<folly::ThreadPoolExecutor::Thread>)> >(folly::detail::function::Data&)(Unknown Source)
    at Unknown.# 14 execute_native_thread_routine(Unknown Source)
    at Unknown.# 15 start_thread(Unknown Source)
    at Unknown.# 16 __clone3(Unknown Source)
arhimondr commented 8 months ago

In Presto the buffer is not deleted on abort and getData requests for an aborted buffer are allowed. Same can be done in Velox (https://github.com/facebookincubator/velox/pull/9207). It looks like some of the other operations are already allowed to come out of order (for example an acknowledge after a delete https://github.com/facebookincubator/velox/blob/main/velox/exec/OutputBuffer.cpp#L636).

An alternative would be to try to abort pending requests, but it might be more difficult due to inherent "raciness" of these operations. Also it is unclear if there are other sequences of requests possible that can lead to a query failure.

@mbasmanova @xiaoxmeng Thoughts?

mbasmanova commented 7 months ago

In Presto the buffer is not deleted on abort and getData requests for an aborted buffer are allowed.

I see. It would be nice to update documentation in https://prestodb.io/docs/current/develop/worker-protocol.html to clarify that.

Let me check https://github.com/facebookincubator/velox/pull/9207

mbasmanova commented 7 months ago

I'm seeing this issue in production from time to time as well: 20240324_170232_00988_y57ss

steveburnett commented 7 months ago

In Presto the buffer is not deleted on abort and getData requests for an aborted buffer are allowed.

I see. It would be nice to update documentation in https://prestodb.io/docs/current/develop/worker-protocol.html to clarify that.

Noted, thanks!

amitkdutta commented 7 months ago

This is a known issue that happens ~10 times a day with failure message VeloxRuntimeError: buffer != nullptr getData received after its buffer is deleted. Destination: <N>, sequence: 0 where N is a integer. Thanks @arhimondr for debugging this complex issue.