Closed yingsu00 closed 1 year ago
@gopukrishnasIBM is looking into it as he run the ibm tpcds workload on higher SF with help of Linsong
This could be similar to previous issues we've seen with at TaskManager failure escalating into other failures in error-handling code paths.
@majetideepak : Any ideas ?
Operator::isBlocked failed for [operator: Exchange, plan node ID: 1465]: Failed to fetched data from 10.78.43.252:8585 /v1/task/20230629_203031_00426_dikw7.6.0.16.0/results/8/0 - Exhausted retries: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused)
These are error messages you see in Prestissimo when a worker tries to fetch data from another Prestissimo worker that likely crashed. We have to see the logs on the node(10.78.43.252) where Prestissimo worker crashed to determine the root cause.
More details would help as well: Do the queries immediately fail, or do they fail after executing for some time? Is it an OOM that killed the Prestissimo process? You can check the kernel logs on that node. Can we reproduce this failure consistently with some minimal query?
@majetideepak It's reproducible on any cluster with 16 or 32 nodes and running TPCDS 10TB. Q55 was the simplest query I found to reproduce the error.
@czentgr @majetideepak @yzhang1991 @MuralidharVVK1 @ashokku2022 Linsong shared 3 coordinator logs and 65 worker node logs with me, which can be found at (https://s3-oss-jenkins.ahana.dev/oss-prestodb/). After examining these, I have some questions and observations. Two of the three coordinators were idle, while one was actively working.
Among the 65 worker logs, 25 of them only have a few lines of log indicating a connection failure. I'm curious as to why there are no subsequent retry attempts.
2023-08-14T20:31:05.739Z I0814 20:31:05.739145 9 PrestoServer.cpp:341] Starting with node memory 230GB
2023-08-14T20:31:05.756Z I0814 20:31:05.756304 9 PrestoServer.cpp:456] STARTUP: Registering catalog oss_glue using connector hive-hadoop2, cache enabled: 1, cache size: 0
2023-08-14T20:31:05.756Z I0814 20:31:05.756423 9 PrestoServer.cpp:169] STARTUP: Starting server at :::8585 (10.78.139.79)
2023-08-14T20:31:05.758Z I0814 20:31:05.758229 20 Announcer.cpp:125] Discovery service changed to 172.20.196.247:8585
2023-08-14T20:31:05.760Z I0814 20:31:05.760804 20 Announcer.cpp:148] Announcement succeeded: 202
2023-08-14T20:31:05.761Z I0814 20:31:05.761909 9 PrestoServer.cpp:275] STARTUP: Starting all periodic tasks...
2023-08-14T20:31:05.762Z I0814 20:31:05.762070 9 HttpServer.cpp:125] STARTUP: proxygen::HTTPServer::start()
2023-08-14T20:31:35.766Z I0814 20:31:35.766618 20 Announcer.cpp:148] Announcement succeeded: 202
2023-08-14T20:32:05.773Z W0814 20:32:05.773288 20 Announcer.cpp:154] Announcement failed: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused)
The first error that I saw on the worker side is an out of memory error
2023-08-14T21:40:11.965Z W0814 21:40:11.965792 643 MmapAllocator.cpp:429] [MEM] Exceeded memory allocator limit when adding 512 new pages for total allocation of 106496 pages, the memory allocator capacity is 60293120
2023-08-14T21:40:11.965Z W0814 21:40:11.965801 745 MmapAllocator.cpp:79] [MEM] Exceeding memory allocator limit when allocate 64 pages with capacity of 60293120
2023-08-14T21:40:12.972Z W0814 21:40:12.971920 672 MmapAllocator.cpp:79] [MEM] Exceeding memory allocator limit when allocate 128 pages with capacity of 60293120
2023-08-14T21:40:13.021Z W0814 21:40:13.021198 684 MmapAllocator.cpp:429] [MEM] Exceeded memory allocator limit when adding 512 new pages for total allocation of 125952 pages, the memory allocator capacity is 60293120
2023-08-14T21:40:13.977Z W0814 21:40:13.977497 708 MmapAllocator.cpp:79] [MEM] Exceeding memory allocator limit when allocate 128 pages with capacity of 60293120
2023-08-14T21:40:14.022Z W0814 21:40:14.022696 749 MmapAllocator.cpp:429] [MEM] Exceeded memory allocator limit when adding 512 new pages for total allocation of 129024 pages, the memory allocator capacity is 60293120
2023-08-14T21:40:14.092Z W0814 21:40:14.092751 816 MmapAllocator.cpp:88] [MEM] Exceeded memory allocator limit when allocate 64 pages with capacity of 60293120
and on the coordinator side is
2023-08-14T21:47:29.507Z 2023-08-14T21:47:29.505Z ERROR remote-task-callback-1642 com.facebook.presto.execution.StageExecutionStateMachine Stage execution 20230814_214520_00015_geneu.26.0 failed
2023-08-14T21:47:29.507Z VeloxRuntimeError: Exceeded memory pool cap of 163.84GB with max 163.84GB when requesting 1.00MB, memory manager cap is 230.00GB, requestor 'op.4742.4.1.LocalPartition' with current usage 1.88MB
2023-08-14T21:47:29.507Z 20230814_214520_00015_geneu usage 163.84GB peak 163.84GB
2023-08-14T21:47:29.507Z task.20230814_214520_00015_geneu.16.0.10.0 usage 198.00MB peak 708.00MB
2023-08-14T21:47:29.507Z node.root usage 167.00MB peak 676.00MB
2023-08-14T21:47:29.507Z op.root.0.13.PartitionedOutput usage 14.41MB peak 39.95MB
Because of this, I suspect that the issue might be related to memory.
Another issue that I found in the logs are "dequeueLocked" worker side:
2023-08-14T22:02:41.957Z I0814 22:02:41.957504 1451 Task.cpp:1647] Terminating task 20230814_220048_00037_geneu.16.0.37.0 with state Failed after running for 112513 ms.
2023-08-14T22:02:41.957Z E0814 22:02:41.957602 1436 Exceptions.h:69] Line: ../../velox/velox/exec/ExchangeQueue.cpp:100, Function:dequeueLocked, Expression: Failed to fetch data from 10.78.174.48:8585 /v1/task/20230814_220048_00037_geneu.17.0.17.0/results/37/0 - Exhausted retries: ingress timeout, streamID=1, Source: RUNTIME, ErrorCode: INVALID_STATE
2023-08-14T22:02:41.957Z E0814 22:02:41.957715 1446 Exceptions.h:69] Line: ../../velox/velox/exec/ExchangeQueue.cpp:100, Function:dequeueLocked, Expression: Failed to fetch data from 10.78.174.48:8585 /v1/task/20230814_220048_00037_geneu.17.0.17.0/results/37/0 - Exhausted retries: ingress timeout, streamID=1, Source: RUNTIME, ErrorCode: INVALID_STATE
2023-08-14T22:02:41.957Z E0814 22:02:41.957813 1457 Exceptions.h:69] Line: ../../velox/velox/exec/ExchangeQueue.cpp:100, Function:dequeueLocked, Expression: Failed to fetch data from 10.78.174.48:8585 /v1/task/20230814_220048_00037_geneu.17.0.17.0/results/37/0 - Exhausted retries: ingress timeout, streamID=1, Source: RUNTIME, ErrorCode: INVALID_STATE
coordinator side:
2023-08-14T22:11:24.515Z VeloxRuntimeError: Failed to fetch data from 10.78.183.226:8585 /v1/task/20230814_220920_00041_geneu.15.0.23.0/results/29/0 - Exhausted retries: ingress timeout, streamID=1
2023-08-14T22:11:24.515Z at Unknown.# 0 _ZN8facebook5velox7process10StackTraceC1Ei(Unknown Source)
2023-08-14T22:11:24.515Z at Unknown.# 1 _ZN8facebook5velox14VeloxExceptionC2EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bNS1_4TypeES7_(Unknown Source)
2023-08-14T22:11:24.515Z at Unknown.# 2 _ZN8facebook5velox6detail14veloxCheckFailINS0_17VeloxRuntimeErrorERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEEvRKNS1_18VeloxCheckFailArgsET0_(Unknown Source)
2023-08-14T22:11:24.515Z at Unknown.# 3 _ZN8facebook5velox4exec13ExchangeQueue13dequeueLockedEPbPN5folly10SemiFutureINS4_4UnitEEE(Unknown Source)
2023-08-14T22:11:24.515Z at Unknown.# 4 _ZN8facebook5velox4exec14ExchangeClient4nextEPbPN5folly10SemiFutureINS4_4UnitEEE(Unknown Source)
2023-08-14T22:11:24.515Z at Unknown.# 5 _ZN8facebook5velox4exec8Exchange9isBlockedEPN5folly10SemiFutureINS3_4UnitEEE(Unknown Source)
2023-08-14T22:11:24.515Z at Unknown.# 6 _ZN8facebook5velox4exec6Driver11runInternalERSt10shared_ptrIS2_ERS3_INS1_13BlockingStateEERS3_INS0_9RowVectorEE(Unknown Source)
2023-08-14T22:11:24.515Z at Unknown.# 7 _ZN8facebook5velox4exec6Driver3runESt10shared_ptrIS2_E(Unknown Source)
2023-08-14T22:11:24.515Z at Unknown.# 8 _ZN5folly6detail8function14FunctionTraitsIFvvEE9callSmallIZN8facebook5velox4exec6Driver7enqueueESt10shared_ptrIS9_EEUlvE_EEvRNS1_4DataE(Unknown Source)
2023-08-14T22:11:24.515Z at Unknown.# 9 _ZN5folly6detail8function14FunctionTraitsIFvvEEclEv(Unknown Source)
2023-08-14T22:11:24.515Z at Unknown.# 10 _ZN5folly18ThreadPoolExecutor7runTaskERKSt10shared_ptrINS0_6ThreadEEONS0_4TaskE(Unknown Source)
2023-08-14T22:11:24.515Z at Unknown.# 11 _ZN5folly21CPUThreadPoolExecutor9threadRunESt10shared_ptrINS_18ThreadPoolExecutor6ThreadEE(Unknown Source)
2023-08-14T22:11:24.515Z at Unknown.# 12 _ZSt13__invoke_implIvRMN5folly18ThreadPoolExecutorEFvSt10shared_ptrINS1_6ThreadEEERPS1_JRS4_EET_St21__invoke_memfun_derefOT0_OT1_DpOT2_(Unknown Source)
2023-08-14T22:11:24.515Z at Unknown.# 13 _ZSt8__invokeIRMN5folly18ThreadPoolExecutorEFvSt10shared_ptrINS1_6ThreadEEEJRPS1_RS4_EENSt15__invoke_resultIT_JDpT0_EE4typeEOSC_DpOSD_(Unknown Source)
2023-08-14T22:11:24.515Z at Unknown.# 14 _ZNSt5_BindIFMN5folly18ThreadPoolExecutorEFvSt10shared_ptrINS1_6ThreadEEEPS1_S4_EE6__callIvJEJLm0ELm1EEEET_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE(Unknown Source)
2023-08-14T22:11:24.515Z at Unknown.# 15 _ZNSt5_BindIFMN5folly18ThreadPoolExecutorEFvSt10shared_ptrINS1_6ThreadEEEPS1_S4_EEclIJEvEET0_DpOT_(Unknown Source)
2023-08-14T22:11:24.515Z at Unknown.# 16 _ZN5folly6detail8function14FunctionTraitsIFvvEE9callSmallISt5_BindIFMNS_18ThreadPoolExecutorEFvSt10shared_ptrINS7_6ThreadEEEPS7_SA_EEEEvRNS1_4DataE(Unknown Source)
2023-08-14T22:11:24.515Z at Unknown.# 17 0x0000000000000000(Unknown Source)
2023-08-14T22:11:24.515Z at Unknown.# 18 start_thread(Unknown Source)
I also came across SIGSEGV
error on different worker logs.
One example is a worker with node.ip=10.78.161.169
2023-08-14T22:09:21.375Z I0814 22:09:21.375032 29 TaskManager.cpp:457] Starting task 20230814_220920_00041_geneu.27.0.3.0 with 16 max drivers.
2023-08-14T22:09:21.388Z I0814 22:09:21.388250 26 TaskManager.cpp:503] No more splits for 20230814_220920_00041_geneu.26.0.0.0 for node 3234
2023-08-14T22:09:21.395Z I0814 22:09:21.395074 29 TaskManager.cpp:503] No more splits for 20230814_220920_00041_geneu.27.0.3.0 for node 50
2023-08-14T22:09:21.395Z I0814 22:09:21.395102 29 TaskManager.cpp:503] No more splits for 20230814_220920_00041_geneu.27.0.3.0 for node 3233
2023-08-14T22:10:56.663Z *** Aborted at 1692051056 (Unix time, try 'date -d @1692051056') ***
2023-08-14T22:10:56.663Z *** Signal 11 (SIGSEGV) (0x7d284f35ac00) received by PID 9 (pthread TID 0x7d47d17fb700) (linux TID 235) (code: address not mapped to object), stack trace: ***
2023-08-14T22:10:56.663Z (error retrieving stack trace)
2023-08-14T22:19:28.355Z I0814 22:19:28.355716 20 Announcer.cpp:161] Announcement succeeded: HTTP 202
coordinator:
2023-08-14T22:14:54.253Z 2023-08-14T22:14:54.253Z WARN TaskInfoFetcher-20230814_220920_00041_geneu.27.0.3.0-4757 com.facebook.presto.server.RequestErrorTracker Error getting info for task 20230814_220920_00041_geneu.27.0.3.0: Server refused connection: http://10.78.161.169:8585/v1/task/20230814_220920_00041_geneu.27.0.3.0?summarize: http://10.78.161.169:8585/v1/task/20230814_220920_00041_geneu.27.0.3.0
In the plan,
Fragment 2 [SOURCE] >
Output layout: [i_brand, i_brand_id, sum_20] >
Output partitioning: HASH [i_brand, i_brand_id] >
Stage Execution Strategy: UNGROUPED_EXECUTION >
- Aggregate(PARTIAL)[i_brand, i_brand_id] => [i_brand:varchar(50), i_brand_id:integer, sum_20:varbinary] >
sum_20 := "presto.default.sum"((ss_ext_sales_price)) (1:81) >
- InnerJoin[("ss_item_sk" = "i_item_sk")] => [ss_ext_sales_price:decimal(7,2), i_brand_id:integer, i_brand:varchar(50)] >
Estimates: {rows: 46351051 (3.50GB), cpu: 15300513743569.34, memory: 5458621.00, network: 5458621.00} >
Distribution: REPLICATED >
- InnerJoin[("ss_sold_date_sk" = "d_date_sk")] => [ss_item_sk:bigint, ss_ext_sales_price:decimal(7,2)] >
Estimates: {rows: 4576801864 (490.19GB), cpu: 15218156819337.52, memory: 8449.70, network: 8449.70} >
Distribution: REPLICATED >
- TableScan[TableHandle {connectorId='hive', connectorHandle='HiveTableHandle{schemaName=tpcds_sf100000_parquet_varchar, tableName=store_sales, analyzePartitionValues=Optional.empty}', layout='Optional[tpcds_sf100000_parquet_varchar.store_sales{}]'}, grouped = false] => [ss_sold_date_sk:bigint, ss_item_sk:bigint, ss_ext_sales_price:decimal(7,2)>
Estimates: {rows: 288002140051 (30.12TB), cpu: 7568710790041.00, memory: 0.00, network: 0.00} >
LAYOUT: tpcds_sf100000_parquet_varchar.store_sales{} >
ss_ext_sales_price := ss_ext_sales_price:decimal(7,2):15:REGULAR (1:137) >
ss_sold_date_sk := ss_sold_date_sk:bigint:0:REGULAR (1:137) >
ss_item_sk := ss_item_sk:bigint:2:REGULAR (1:137) >
- LocalExchange[HASH] (d_date_sk) => [d_date_sk:bigint] >
Estimates: {rows: 30 (3.40kB), cpu: 545.14, memory: 0.00, network: 8449.70} >
- RemoteSource[3] => [d_date_sk:bigint] >
- LocalExchange[HASH] (i_item_sk) => [i_item_sk:bigint, i_brand_id:integer, i_brand:varchar(50)] >
Estimates: {rows: 5008 (396.65kB), cpu: 351623.96, memory: 0.00, network: 5450171.30} >
- RemoteSource[4] => [i_item_sk:bigint, i_brand_id:integer, i_brand:varchar(50)]
How can we distinguish between the probe table and the build table? Could using the larger table as the build table potentially lead to a memory overflow error?
The build sides for the innerjoins in stage2 of the query 55 are pretty small. The scans from stage3 and stage4 are fed as HashBuild side for the inner joins.
The stage2 slowdown is mostly coming from the TableScan of storesales and the HashProbe operator of the innerjoin "ss_item_sk" = "i_item_sk"
.
I ran the query twice and I am not able to repro the issue:
presto:tpcds_sf100000_parquet_varchar> SELECT "i_brand_id" "brand_id" , "i_brand" "brand" , "sum"("ss_ext_sales_price") "ext_price" FROM date_dim , store_sales , item WHERE ("d_date_sk" = "ss_sold_date_sk") AND ("ss_item_sk" = "i_item_sk") AND ("i_manager_id" = 28) AND ("d_moy" = 11) AND ("d_year" = 1999) GROUP BY "i_brand", "i_brand_id" ORDER BY "ext_price" DESC, "i_brand_id" ASC LIMIT 100;
brand_id | brand | ext_price
----------+----------------------+---------------
2003001 | exportiimporto #1 | 5964127572.62
5004001 | edu packscholar #1 | 5576632210.47
5002001 | importoscholar #1 | 5567732557.60
2002001 | importoimporto #1 | 5449225119.18
4004001 | edu packedu pack #1 | 5312633690.87
4001001 | amalgedu pack #1 | 5279164166.79
3004001 | edu packexporti #1 | 5261083386.89
3002001 | importoexporti #1 | 5209729864.53
2004001 | edu packimporto #1 | 5180197512.20
1001001 | amalgamalg #1 | 5143336367.63
3001001 | amalgexporti #1 | 5084697742.14
4002001 | importoedu pack #1 | 5044387871.94
3003001 | exportiexporti #1 | 4814667457.27
5001001 | amalgscholar #1 | 4719865362.20
4003001 | exportiedu pack #1 | 4707612519.32
1003001 | exportiamalg #1 | 4627904623.74
5003001 | exportischolar #1 | 4575805261.31
2001001 | amalgimporto #1 | 4486235392.55
1002001 | importoamalg #1 | 4180718016.42
1004001 | edu packamalg #1 | 4101025071.39
3003002 | exportiexporti #2 | 3192322918.02
4001002 | amalgedu pack #2 | 3006853728.39
5001002 | amalgscholar #2 | 2852154932.45
5002002 | importoscholar #2 | 2833324097.64
5004002 | edu packscholar #2 | 2796275436.75
2001002 | amalgimporto #2 | 2762719006.14
This cluster has 64 workers? It's different than the Benchto cluster which had 16 nodes and it's possible it doesn't reproduce. Can you run other queries?
Scale Factor: sf100000
I enabled glog signal handler in Prestissimp: https://github.com/prestodb/presto/commit/2978dc802ab9e53d674fce0cb942f53eb7a5e14a
Cluster: https://large-0-284-d7f5be4-b40n.oss-prestodb.cp.ahana.cloud/ui/ This helped generate a stack trace:
2023-08-30T14:11:53.899-07:00 | I0830 21:11:53.899216 20 TaskManager.cpp:662] cleanOldTasks: Cleaned 1 old task(s) in 0 ms
-- | --
| 2023-08-30T14:11:57.540-07:00 | *** Aborted at 1693429917 (unix time) try "date -d @1693429917" if you are using GNU date ***
| 2023-08-30T14:11:57.542-07:00 | PC: @ 0x0 (unknown)
| 2023-08-30T14:11:57.542-07:00 | *** SIGSEGV (@0x7d4758940a80) received by PID 9 (TID 0x7d4f975fe700) from PID 1486097024; stack trace: ***
| 2023-08-30T14:11:57.543-07:00 | @ 0x7f55adbc448e google::(anonymous namespace)::FailureSignalHandler()
| 2023-08-30T14:11:57.544-07:00 | @ 0x7f55aa267cf0 (unknown)
| 2023-08-30T14:11:57.549-07:00 | @ 0x3cd708d facebook::velox::exec::HashTable<>::insertForGroupBy()
| 2023-08-30T14:11:57.552-07:00 | @ 0x3ce20b2 facebook::velox::exec::HashTable<>::insertBatch()
| 2023-08-30T14:11:57.556-07:00 | @ 0x3ceb080 facebook::velox::exec::HashTable<>::rehash()
| 2023-08-30T14:11:57.559-07:00 | @ 0x3ceb312 facebook::velox::exec::HashTable<>::checkSize()
| 2023-08-30T14:11:57.561-07:00 | @ 0x3ceb3dc facebook::velox::exec::HashTable<>::groupProbe()
| 2023-08-30T14:11:57.564-07:00 | @ 0x3ca8586 facebook::velox::exec::GroupingSet::addInputForActiveRows()
| 2023-08-30T14:11:57.567-07:00 | @ 0x3ca8d05 facebook::velox::exec::GroupingSet::addInput()
| 2023-08-30T14:11:57.572-07:00 | @ 0x3caffee facebook::velox::exec::HashAggregation::addInput()
| 2023-08-30T14:11:57.578-07:00 | @ 0x3b7c8eb facebook::velox::exec::Driver::runInternal()
| 2023-08-30T14:11:57.581-07:00 | @ 0x3b7d697 facebook::velox::exec::Driver::run()
| 2023-08-30T14:11:57.583-07:00 | @ 0x3b7d83f _ZN5folly6detail8function14FunctionTraitsIFvvEE9callSmallIZN8facebook5velox4exec6Driver7enqueueESt10shared_ptrIS9_EEUlvE_EEvRNS1_4DataE
| 2023-08-30T14:11:57.584-07:00 | @ 0x7f55abff6bd9 folly::detail::function::FunctionTraits<>::operator()()
| 2023-08-30T14:11:57.588-07:00 | @ 0x7f55a757a954 folly::ThreadPoolExecutor::runTask()
| 2023-08-30T14:11:57.590-07:00 | @ 0x7f55a74fb330 folly::CPUThreadPoolExecutor::threadRun()
| 2023-08-30T14:11:57.592-07:00 | @ 0x7f55a7583055 std::__invoke_impl<>()
| 2023-08-30T14:11:57.594-07:00 | @ 0x7f55a7582583 std::__invoke<>()
| 2023-08-30T14:11:57.596-07:00 | @ 0x7f55a75818b3 _ZNSt5_BindIFMN5folly18ThreadPoolExecutorEFvSt10shared_ptrINS1_6ThreadEEEPS1_S4_EE6__callIvJEJLm0ELm1EEEET_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE
| 2023-08-30T14:11:57.599-07:00 | @ 0x7f55a7580956 std::_Bind<>::operator()<>()
| 2023-08-30T14:11:57.602-07:00 | @ 0x7f55a757f91a folly::detail::function::FunctionTraits<>::callSmall<>()
| 2023-08-30T14:11:57.603-07:00 | @ 0x7f55a55e1b23 (unknown)
| 2023-08-30T14:11:57.604-07:00 | @ 0x7f55aa25d1ca start_thread
The above stack trace is from one of the workers. All workers seem to receive SEGV:
1
2023-08-30T14:11:57.542-07:00
*** SIGSEGV (@0x7d4758940a80) received by PID 9 (TID 0x7d4f975fe700) from PID 1486097024; stack trace: ***
[worker.10.78.173.169.presto-server](https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/$252Fahana$252Foss-prestodb$252Flarge-0-284-d7f5be4-b40n/log-events/worker.10.78.173.169.presto-server$3Fstart$3D2023-08-30T21$253A11$253A57.542Z)
932483864676:/ahana/oss-prestodb/large-0-284-d7f5be4-b40n
2
2023-08-30T14:11:56.530-07:00
*** SIGSEGV (@0x7d180c04d300) received by PID 9 (TID 0x7d1ffe1fc700) from PID 201642752; stack trace: ***
[worker.10.78.174.173.presto-server](https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/$252Fahana$252Foss-prestodb$252Flarge-0-284-d7f5be4-b40n/log-events/worker.10.78.174.173.presto-server$3Fstart$3D2023-08-30T21$253A11$253A56.530Z)
932483864676:/ahana/oss-prestodb/large-0-284-d7f5be4-b40n
3
2023-08-30T14:11:56.431-07:00
*** SIGSEGV (@0x7d1ddeb1fb00) received by PID 9 (TID 0x7d2548bfd700) from PID 18446744073150790400; stack trace: ***
[worker.10.78.149.220.presto-server](https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/$252Fahana$252Foss-prestodb$252Flarge-0-284-d7f5be4-b40n/log-events/worker.10.78.149.220.presto-server$3Fstart$3D2023-08-30T21$253A11$253A56.431Z)
932483864676:/ahana/oss-prestodb/large-0-284-d7f5be4-b40n
4
2023-08-30T14:11:56.234-07:00
*** SIGSEGV (@0x7cef66c83480) received by PID 9 (TID 0x7cf73e1fc700) from PID 1724396672; stack trace: ***
[worker.10.78.181.82.presto-server](https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/$252Fahana$252Foss-prestodb$252Flarge-0-284-d7f5be4-b40n/log-events/worker.10.78.181.82.presto-server$3Fstart$3D2023-08-30T21$253A11$253A56.234Z)
932483864676:/ahana/oss-prestodb/large-0-284-d7f5be4-b40n
5
2023-08-30T14:11:56.049-07:00
*** SIGSEGV (@0x7d163edc5280) received by PID 9 (TID 0x7d1dc21fc700) from PID 1054626432; stack trace: ***
[worker.10.78.135.166.presto-server](https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/$252Fahana$252Foss-prestodb$252Flarge-0-284-d7f5be4-b40n/log-events/worker.10.78.135.166.presto-server$3Fstart$3D2023-08-30T21$253A11$253A56.049Z)
932483864676:/ahana/oss-prestodb/large-0-284-d7f5be4-b40n
6
2023-08-30T14:11:55.701-07:00
*** SIGSEGV (@0x7ddba492a800) received by PID 9 (TID 0x7de35cdfa700) from PID 18446744072175659008; stack trace: ***
[worker.10.78.170.201.presto-server](https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/$252Fahana$252Foss-prestodb$252Flarge-0-284-d7f5be4-b40n/log-events/worker.10.78.170.201.presto-server$3Fstart$3D2023-08-30T21$253A11$253A55.701Z)
932483864676:/ahana/oss-prestodb/large-0-284-d7f5be4-b40n
7
2023-08-30T14:11:55.640-07:00
*** SIGSEGV (@0x7d625b3e7780) received by PID 9 (TID 0x7d6a60dfa700) from PID 1530820480; stack trace: ***
[worker.10.78.148.215.presto-server](https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/$252Fahana$252Foss-prestodb$252Flarge-0-284-d7f5be4-b40n/log-events/worker.10.78.148.215.presto-server$3Fstart$3D2023-08-30T21$253A11$253A55.640Z)
932483864676:/ahana/oss-prestodb/large-0-284-d7f5be4-b40n
8
2023-08-30T14:11:55.532-07:00
*** SIGSEGV (@0x7d7544b5d800) received by PID 9 (TID 0x7d7d2d5fe700) from PID 1152768000; stack trace: ***
[worker.10.78.163.56.presto-server](https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/$252Fahana$252Foss-prestodb$252Flarge-0-284-d7f5be4-b40n/log-events/worker.10.78.163.56.presto-server$3Fstart$3D2023-08-30T21$253A11$253A55.532Z)
932483864676:/ahana/oss-prestodb/large-0-284-d7f5be4-b40n
9
2023-08-30T14:11:55.443-07:00
*** SIGSEGV (@0x7d929a0e6d80) received by PID 9 (TID 0x7d993e1fc700) from PID 18446744071999221120; stack trace: ***
[worker.10.78.158.21.presto-server](https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/$252Fahana$252Foss-prestodb$252Flarge-0-284-d7f5be4-b40n/log-events/worker.10.78.158.21.presto-server$3Fstart$3D2023-08-30T21$253A11$253A55.443Z)
932483864676:/ahana/oss-prestodb/large-0-284-d7f5be4-b40n
10
2023-08-30T14:11:55.434-07:00
*** SIGSEGV (@0x7d8e8855d480) received by PID 9 (TID 0x7d96affff700) from PID 18446744071701910656; stack trace: ***
[worker.10.78.176.59.presto-server](https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/$252Fahana$252Foss-prestodb$252Flarge-0-284-d7f5be4-b40n/log-events/worker.10.78.176.59.presto-server$3Fstart$3D2023-08-30T21$253A11$253A55.434Z)
932483864676:/ahana/oss-prestodb/large-0-284-d7f5be4-b40n
11
2023-08-30T14:11:55.359-07:00
*** SIGSEGV (@0x7d69834a5200) received by PID 9 (TID 0x7d714e1fc700) from PID 18446744071617270272; stack trace: ***
[worker.10.78.165.25.presto-server](https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/$252Fahana$252Foss-prestodb$252Flarge-0-284-d7f5be4-b40n/log-events/worker.10.78.165.25.presto-server$3Fstart$3D2023-08-30T21$253A11$253A55.359Z)
932483864676:/ahana/oss-prestodb/large-0-284-d7f5be4-b40n
12
2023-08-30T14:11:55.168-07:00
*** SIGSEGV (@0x7d1a2ab1fb00) received by PID 9 (TID 0x7d20babfd700) from PID 716307200; stack trace: ***
[worker.10.78.168.1.presto-server](https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/$252Fahana$252Foss-prestodb$252Flarge-0-284-d7f5be4-b40n/log-events/worker.10.78.168.1.presto-server$3Fstart$3D2023-08-30T21$253A11$253A55.168Z)
932483864676:/ahana/oss-prestodb/large-0-284-d7f5be4-b40n
13
2023-08-30T14:11:55.071-07:00
*** SIGSEGV (@0x7dd882df7e00) received by PID 9 (TID 0x7ddf8cdfa700) from PID 18446744071610269184; stack trace: ***
[worker.10.78.147.183.presto-server](https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/$252Fahana$252Foss-prestodb$252Flarge-0-284-d7f5be4-b40n/log-events/worker.10.78.147.183.presto-server$3Fstart$3D2023-08-30T21$253A11$253A55.071Z)
932483864676:/ahana/oss-prestodb/large-0-284-d7f5be4-b40n
14
2023-08-30T14:11:54.919-07:00
*** SIGSEGV (@0x7cf339951900) received by PID 9 (TID 0x7cfa84dfa700) from PID 966072576; stack trace: ***
[worker.10.78.156.44.presto-server](https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/$252Fahana$252Foss-prestodb$252Flarge-0-284-d7f5be4-b40n/log-events/worker.10.78.156.44.presto-server$3Fstart$3D2023-08-30T21$253A11$253A54.919Z)
932483864676:/ahana/oss-prestodb/large-0-284-d7f5be4-b40n
15
2023-08-30T14:11:54.872-07:00
*** SIGSEGV (@0x7d6500fbec00) received by PID 9 (TID 0x7d6cb57fb700) from PID 16509952; stack trace: ***
[worker.10.78.149.36.presto-server](https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/$252Fahana$252Foss-prestodb$252Flarge-0-284-d7f5be4-b40n/log-events/worker.10.78.149.36.presto-server$3Fstart$3D2023-08-30T21$253A11$253A54.872Z)
932483864676:/ahana/oss-prestodb/large-0-284-d7f5be4-b40n
16
2023-08-30T14:11:54.779-07:00
*** SIGSEGV (@0x7d27cae04780) received by PID 9 (TID 0x7d3039fff700) from PID 18446744072818280320; stack trace: ***
[worker.10.78.128.255.presto-server](https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/$252Fahana$252Foss-prestodb$252Flarge-0-284-d7f5be4-b40n/log-events/worker.10.78.128.255.presto-server$3Fstart$3D2023-08-30T21$253A11$253A54.779Z)
932483864676:/ahana/oss-prestodb/large-0-284-d7f5be4-b40n
I spot checked few workers and they all have the same stacktrace.
@mbasmanova , @xiaoxmeng : Any ideas about this crash ?
To debug something like this, try to simplify the query and still reproduce the crash. Once you have a simple query, reduce the amount of data it processed using a filter on "$path". Once you have a simple repro processing one file, you can implement the query in a unit test and use standard debugging tools.
@mbasmanova : We don't see these errors in lower scale factors of the TPC-DS including sf1K. It shows up only when query is 10K scale factor (so 10 TB) of data. We can try to determine the number of rows triggering HashTable rewrite here. But just thought to check if you'll have seen such a failure before and can share some clues.
This was the query that failed:
SELECT
"count"(DISTINCT "cs_order_number") "order count"
, "sum"("cs_ext_ship_cost") "total shipping cost"
, "sum"("cs_net_profit") "total net profit"
FROM
tpcds_sf100000_parquet_varchar.catalog_sales cs1
, tpcds_sf100000_parquet_varchar.date_dim
, tpcds_sf100000_parquet_varchar.customer_address
, tpcds_sf100000_parquet_varchar.call_center
WHERE ("d_date" BETWEEN CAST('2002-2-01' AS DATE) AND (CAST('2002-2-01' AS DATE) + INTERVAL '60' DAY))
AND ("cs1"."cs_ship_date_sk" = "d_date_sk")
AND ("cs1"."cs_ship_addr_sk" = "ca_address_sk")
AND ("ca_state" = 'GA')
AND ("cs1"."cs_call_center_sk" = "cc_call_center_sk")
AND ("cc_county" IN ('Williamson County', 'Williamson County', 'Williamson County', 'Williamson County', 'Williamson County'))
AND (EXISTS (
SELECT *
FROM
tpcds_sf100000_parquet_varchar.catalog_sales cs2
WHERE ("cs1"."cs_order_number" = "cs2"."cs_order_number")
AND ("cs1"."cs_warehouse_sk" <> "cs2"."cs_warehouse_sk")
))
AND (NOT (EXISTS (
SELECT *
FROM
tpcds_sf100000_parquet_varchar.catalog_returns cr1
WHERE ("cs1"."cs_order_number" = "cr1"."cr_order_number")
)))
ORDER BY "count"(DISTINCT "cs_order_number") ASC
LIMIT 100
Stages 1 and 2 failed. Stage 1 has the Aggregations and stage 2 is just a table scan.
Fragment 1 [HASH] >
Output layout: [count_317, sum_316, sum_318] >
Output partitioning: SINGLE [] >
Stage Execution Strategy: UNGROUPED_EXECUTION >
- Aggregate(PARTIAL) => [count_317:bigint, sum_316:varbinary, sum_318:varbinary] >
count_317 := "presto.default.count"((cs_order_number)) (mask = cs_order_number_313) (29:10) >
sum_316 := "presto.default.sum"((cs_net_profit)) (4:8) >
sum_318 := "presto.default.sum"((cs_ext_ship_cost)) (3:8) >
- MarkDistinct[distinct=cs_order_number:bigint marker=cs_order_number_313] => [cs_order_number:bigint, cs_ext_ship_cost:decimal(7,2), cs_net_profit:decimal(7,2), cs_order_number>
- LocalExchange[HASH] (cs_order_number) => [cs_order_number:bigint, cs_ext_ship_cost:decimal(7,2), cs_net_profit:decimal(7,2)] >
- FilterProject[filterPredicate = not(COALESCE(subquerytrue, BOOLEAN'false')), projectLocality = LOCAL] => [cs_order_number:bigint, cs_ext_ship_cost:decimal(7,2), cs_net>
- RightJoin[("cr_order_number" = "cs_order_number")] => [subquerytrue:boolean, cs_order_number:bigint, cs_ext_ship_cost:decimal(7,2), cs_net_profit:decimal(7,2)] >
Distribution: PARTITIONED >
- Project[projectLocality = LOCAL] => [subquerytrue:boolean, cr_order_number:bigint] >
Estimates: {source: CostBasedSourceInfo, rows: 9217045187 (1021.50GB), cpu: 620010869893.00, memory: 82953406683.00, network: 129655843209.00} >
subquerytrue := BOOLEAN'true' >
- Aggregate[cr_order_number] => [cr_order_number:bigint] >
Estimates: {source: CostBasedSourceInfo, rows: 9217045187 (1021.50GB), cpu: 518623372836.00, memory: 82953406683.00, network: 129655843209.00} >
- LocalExchange[HASH] (cr_order_number) => [cr_order_number:bigint] >
Estimates: {source: CostBasedSourceInfo, rows: 14406204801 (1.56TB), cpu: 388967529627.00, memory: 0.00, network: 129655843209.00} >
- RemoteSource[2] => [cr_order_number:bigint] >
- LocalExchange[HASH] (cs_order_number) => [cs_order_number:bigint, cs_ext_ship_cost:decimal(7,2), cs_net_profit:decimal(7,2)] >
- FilterProject[filterPredicate = (count_315) > (BIGINT'0'), projectLocality = LOCAL] => [cs_order_number:bigint, cs_ext_ship_cost:decimal(7,2), cs_net_profi>
- Aggregate(FINAL)[cs_ship_date_sk, cs_ship_addr_sk, cs_call_center_sk, cs_warehouse_sk, cs_order_number, cs_ext_ship_cost, cs_net_profit, d_date, ca_sta>
count_315 := "presto.default.count"((count_319)) >
- LocalExchange[HASH] (cs_ship_date_sk, cs_ship_addr_sk, cs_call_center_sk, cs_warehouse_sk, cs_order_number, cs_ext_ship_cost, cs_net_profit, d_date>
- Aggregate(PARTIAL)[cs_ship_date_sk, cs_ship_addr_sk, cs_call_center_sk, cs_warehouse_sk, cs_order_number, cs_ext_ship_cost, cs_net_profit, d_da>
count_319 := "presto.default.count"((non_null)) >
- RightJoin[("cs_order_number_26" = "cs_order_number") AND (cs_warehouse_sk) <> (cs_warehouse_sk_23)] => [non_null:boolean, cs_ship_date_sk:b>
Estimates: {source: CostBasedSourceInfo, rows: 28966521 (3.21GB), cpu: 29698269979850.87, memory: 589553296.57, network: 287509835213>
Distribution: PARTITIONED >
- RemoteSource[3] => [cs_order_number_26:bigint, cs_warehouse_sk_23:bigint, non_null:boolean] >
- LocalExchange[HASH] (cs_order_number) => [cs_ship_date_sk:bigint, cs_ship_addr_sk:bigint, cs_call_center_sk:bigint, cs_warehouse_sk:big>
Estimates: {source: CostBasedSourceInfo, rows: 3358884 (381.19MB), cpu: 18484753684194.89, memory: 243464742.30, network: 5593233>
- AssignUniqueId => [cs_ship_date_sk:bigint, cs_ship_addr_sk:bigint, cs_call_center_sk:bigint, cs_warehouse_sk:bigint, cs_order_numbe>
Estimates: {source: CostBasedSourceInfo, rows: 3358884 (381.19MB), cpu: 18484407595640.62, memory: 243464742.30, network: 559>
- RemoteSource[4] => [cs_ship_date_sk:bigint, cs_ship_addr_sk:bigint, cs_call_center_sk:bigint, cs_warehouse_sk:bigint, cs_order_>
>
Fragment 2 [SOURCE] >
Output layout: [cr_order_number] >
Output partitioning: HASH [cr_order_number] >
Stage Execution Strategy: UNGROUPED_EXECUTION >
- TableScan[TableHandle {connectorId='hive', connectorHandle='HiveTableHandle{schemaName=tpcds_sf100000_parquet_varchar, tableName=catalog_returns, analyzePartitionValues=Optional.e>
Estimates: {source: CostBasedSourceInfo, rows: 14406204801 (120.75GB), cpu: 129655843209.00, memory: 0.00, network: 0.00} >
LAYOUT: tpcds_sf100000_parquet_varchar.catalog_returns{} >
cr_order_number := cr_order_number:bigint:16:REGULAR (26:9)
Karteek is trying to track down the line of code that caused the issue by a debug build. @majetideepak to sync offline.
I enabled debug build:
2023-09-06T13:46:17.750-07:00 | PC: @ 0x0 (unknown)
-- | --
| 2023-09-06T13:46:17.750-07:00 | *** SIGSEGV (@0x7d23f4cde500) received by PID 9 (TID 0x7d2b5dbfd700) from PID 18446744073521718528; stack trace: ***
| 2023-09-06T13:46:17.752-07:00 | @ 0x7f3183a7048e google::(anonymous namespace)::FailureSignalHandler()
| 2023-09-06T13:46:17.753-07:00 | @ 0x7f3180113cf0 (unknown)
| 2023-09-06T13:46:17.775-07:00 | @ 0x70736af facebook::velox::exec::BaseHashTable::loadTags()
| 2023-09-06T13:46:17.809-07:00 | @ 0x7082620 facebook::velox::exec::HashTable<>::insertForGroupBy()
We could exactly point to which line of code we hit the segfault:
[root@presto-worker-799d46f88-mb99l app]# addr2line -spiCe /usr/local/bin/presto_server 0x70736af
emmintrin.h:703
(inlined by) HashTable.h:289
loadTags(uint8_t* tags, int32_t tagIndex) {
// Cannot use xsimd::batch::unaligned here because we need to skip TSAN.
auto src = tags + tagIndex;
#if XSIMD_WITH_SSE2
return TagVector(_mm_loadu_si128(reinterpret_cast<__m128i const*>(src))); <---- Causing segfault.
#elif XSIMD_WITH_NEON
return TagVector(vld1q_u8(src));
#endif
}
I simplified the query as follows to reproduce the issues:
SELECT
"count"(DISTINCT "cs_order_number") "order count"
FROM
tpcds_sf100000_parquet_varchar.catalog_sales cs1
WHERE
NOT (EXISTS (
SELECT *
FROM
tpcds_sf100000_parquet_varchar.catalog_returns cr1
WHERE ("cs1"."cs_order_number" = "cr1"."cr_order_number")
))
ORDER BY "count"(DISTINCT "cs_order_number") ASC
LIMIT 100;
2023-09-06T16:13:31.977-07:00 | PC: @ 0x0 (unknown)
-- | --
| 2023-09-06T16:13:31.977-07:00 | *** SIGSEGV (@0x7dc751ff6700) received by PID 9 (TID 0x7dcdbebfd700) from PID 1375692544; stack trace: ***
| 2023-09-06T16:13:31.978-07:00 | @ 0x7fd432da448e google::(anonymous namespace)::FailureSignalHandler()
| 2023-09-06T16:13:31.979-07:00 | @ 0x7fd42f447cf0 (unknown)
| 2023-09-06T16:13:32.002-07:00 | @ 0x70736af facebook::velox::exec::BaseHashTable::loadTags()
| 2023-09-06T16:13:32.036-07:00 | @ 0x7082620 facebook::velox::exec::HashTable<>::insertForGroupBy()
| 2023-09-06T16:13:32.067-07:00 | @ 0x708201c facebook::velox::exec::HashTable<>::insertBatch()
| 2023-09-06T16:13:32.080-07:00 | @ 0x7081738 facebook::velox::exec::HashTable<>::rehash()
| 2023-09-06T16:13:32.097-07:00 | @ 0x7081f18 facebook::velox::exec::HashTable<>::checkSize()
| 2023-09-06T16:13:32.127-07:00 | @ 0x707d2b8 facebook::velox::exec::HashTable<>::groupProbe()
| 2023-09-06T16:13:32.138-07:00 | @ 0x7036b18 facebook::velox::exec::GroupingSet::addInputForAct
Thanks to @czentgr for pointing out the issue with loadTags().
The failure in HashTable::loadTags() was due to downcast of address offset from int64_t to int32_t. For large dataset queries with billions of rows as unique keys, the offset could easily cross int32 max and cause seg fault. PR:https://github.com/facebookincubator/velox/pull/6485 addresses the issue. Tested a query that used fail on a 16 node cluster created from this branch https://github.com/prestodb/presto/pull/20762/. Query passed:
presto> SELECT
-> "count"(DISTINCT "cs_order_number") "order count"
-> , "sum"("cs_ext_ship_cost") "total shipping cost"
-> , "sum"("cs_net_profit") "total net profit"
-> FROM
-> tpcds_sf100000_parquet_varchar.catalog_sales cs1
-> , tpcds_sf100000_parquet_varchar.date_dim
-> , tpcds_sf100000_parquet_varchar.customer_address
-> , tpcds_sf100000_parquet_varchar.call_center
-> WHERE ("d_date" BETWEEN CAST('2002-2-01' AS DATE) AND (CAST('2002-2-01' AS DATE) + INTERVAL '60' DAY))
-> AND ("cs1"."cs_ship_date_sk" = "d_date_sk")
-> AND ("cs1"."cs_ship_addr_sk" = "ca_address_sk")
-> AND ("ca_state" = 'GA')
-> AND ("cs1"."cs_call_center_sk" = "cc_call_center_sk")
-> AND ("cc_county" IN ('Williamson County', 'Williamson County', 'Williamson County', 'Williamson County', 'Williamson County'))
-> AND (EXISTS (
-> SELECT *
-> FROM
-> tpcds_sf100000_parquet_varchar.catalog_sales cs2
-> WHERE ("cs1"."cs_order_number" = "cs2"."cs_order_number")
-> AND ("cs1"."cs_warehouse_sk" <> "cs2"."cs_warehouse_sk")
-> ))
-> AND (NOT (EXISTS (
-> SELECT *
-> FROM
-> tpcds_sf100000_parquet_varchar.catalog_returns cr1
-> WHERE ("cs1"."cs_order_number" = "cr1"."cr_order_number")
-> )))
-> ORDER BY "count"(DISTINCT "cs_order_number") ASC
-> LIMIT 100;
order count | total shipping cost | total net profit
-------------+---------------------+------------------
1636473 | 7349832475.20 | -1526555543.44
(1 row)
Many queries failed on higher SF 10TB on large cluster with 16 workers of r5.8xlarge (https://oss.ahana.dev/benchto-run-details?runs=397) 100TB on xlarge cluster with 32 workers of r5.16xlarge (https://oss.ahana.dev/benchto-run-details?runs=398)
These failed queries all have such error
There was no pattern in the failed nodes IP so we can eliminate single node failures.
Example query Q55 20230630_000636_00000_dikw7
Error message and Stack trace
The query has 5 stages (0-4), and stage 3 and 4 finished successfully.
All the rest tasks in stage 0 , 1, 2 failed or aborted.
In stage 2, there are total 28 tasks, and 26 were aborted, 2 failed. Example aborted task in stage 2: 20230630_000636_00000_dikw7.2.0.20.0 Example failed task in stage 2: 20230630_000636_00000_dikw7.2.0.20.0 Example failed task in stage 1: 20230630_000636_00000_dikw7.1.0.0.0 AbortedTasks.zip
The aborted task stack trace: