TPC-DS q11, q14, q23, q38, q74, q75, q87: VeloxRuntimeError: it != idToString_.end()

ethanyzhang commented 1 year ago

Only found when running TPC-DS SF-10K

VeloxRuntimeError: it != idToString_.end() Trying to add a reference to id 935976 that is not in StringIdMap Split [Hive: s3a://tpcdssf10000hive/catalog_sales/cs_sold_date_sk=2452252/20230528_001913_00055_x7tbi_f674a2fe-f023-4a0d-946e-3294a1781a5b 0 - 59418207] Task 20230614_203138_00006_g8upd.20.0.1.0
    at Unknown.# 0  _ZN8facebook5velox7process10StackTraceC1Ei(Unknown Source)
    at Unknown.# 1  _ZN8facebook5velox14VeloxExceptionC2EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bNS1_4TypeES7_(Unknown Source)
    at Unknown.# 2  _ZN8facebook5velox6detail14veloxCheckFailINS0_17VeloxRuntimeErrorERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEEvRKNS1_18VeloxCheckFailArgsET0_(Unknown Source)
    at Unknown.# 3  _ZN8facebook5velox11StringIdMap12addReferenceEm(Unknown Source)
    at Unknown.# 4  _ZN8facebook5velox5cache10CacheShard9initEntryENS1_15RawFileCacheKeyEPNS1_19AsyncDataCacheEntryE(Unknown Source)
    at Unknown.# 5  _ZN8facebook5velox5cache10CacheShard12findOrCreateENS1_15RawFileCacheKeyEmPN5folly10SemiFutureIbEE(Unknown Source)
    at Unknown.# 6  _ZN8facebook5velox5cache14AsyncDataCache12findOrCreateENS1_15RawFileCacheKeyEmPN5folly10SemiFutureIbEE(Unknown Source)
    at Unknown.# 7  _ZN8facebook5velox4dwio6common16CacheInputStream8loadSyncENS2_6RegionE(Unknown Source)
    at Unknown.# 8  _ZN8facebook5velox4dwio6common16CacheInputStream12loadPositionEv.localalias(Unknown Source)
    at Unknown.# 9  _ZN8facebook5velox4dwio6common16CacheInputStream4NextEPPKvPi(Unknown Source)
    at Unknown.# 10 _ZN8facebook5velox7parquet10PageReader14readPageHeaderEv(Unknown Source)
    at Unknown.# 11 _ZN8facebook5velox7parquet10PageReader10seekToPageEl(Unknown Source)
    at Unknown.# 12 _ZN8facebook5velox7parquet10PageReader11rowsForPageERNS0_4dwio6common21SelectiveColumnReaderEbbRN5folly5RangeIPKiEERPKm(Unknown Source)
    at Unknown.# 13 _ZN8facebook5velox7parquet10PageReader15readWithVisitorINS0_4dwio6common13ColumnVisitorIiNS0_6common10AlwaysTrueENS5_15ExtractToReaderINS5_28SelectiveIntegerColumnReaderEEELb1EEEEEvRT_(Unknown Source)
    at Unknown.# 14 _ZN8facebook5velox4dwio6common28SelectiveIntegerColumnReader10readHelperINS0_7parquet19IntegerColumnReaderENS0_6common10AlwaysTrueELb1ENS2_15ExtractToReaderIS3_EEEEvPNS7_6FilterEN5folly5RangeIPKiEET2_(Unknown Source)
    at Unknown.# 15 _ZN8facebook5velox4dwio6common28SelectiveIntegerColumnReader10readCommonINS0_7parquet19IntegerColumnReaderEEEvN5folly5RangeIPKiEE(Unknown Source)
    at Unknown.# 16 _ZN8facebook5velox4dwio6common31SelectiveStructColumnReaderBase4readEiN5folly5RangeIPKiEEPKm(Unknown Source)
    at Unknown.# 17 _ZN8facebook5velox4dwio6common31SelectiveStructColumnReaderBase4nextEmRSt10shared_ptrINS0_10BaseVectorEEPKNS2_8MutationE(Unknown Source)
    at Unknown.# 18 _ZN8facebook5velox7parquet16ParquetRowReader4nextEmRSt10shared_ptrINS0_10BaseVectorEEPKNS0_4dwio6common8MutationE(Unknown Source)
    at Unknown.# 19 _ZN8facebook5velox9connector4hive14HiveDataSource4nextEmRN5folly10SemiFutureINS4_4UnitEEE(Unknown Source)
    at Unknown.# 20 _ZN8facebook5velox4exec9TableScan9getOutputEv(Unknown Source)
    at Unknown.# 21 _ZN8facebook5velox4exec6Driver11runInternalERSt10shared_ptrIS2_ERS3_INS1_13BlockingStateEERS3_INS0_9RowVectorEE(Unknown Source)
    at Unknown.# 22 _ZN8facebook5velox4exec6Driver3runESt10shared_ptrIS2_E(Unknown Source)
    at Unknown.# 23 _ZN5folly6detail8function14FunctionTraitsIFvvEE9callSmallIZN8facebook5velox4exec6Driver7enqueueESt10shared_ptrIS9_EEUlvE_EEvRNS1_4DataE(Unknown Source)
    at Unknown.# 24 _ZN5folly6detail8function14FunctionTraitsIFvvEEclEv(Unknown Source)
    at Unknown.# 25 _ZN5folly18ThreadPoolExecutor7runTaskERKSt10shared_ptrINS0_6ThreadEEONS0_4TaskE(Unknown Source)
    at Unknown.# 26 _ZN5folly21CPUThreadPoolExecutor9threadRunESt10shared_ptrINS_18ThreadPoolExecutor6ThreadEE(Unknown Source)
    at Unknown.# 27 _ZSt13__invoke_implIvRMN5folly18ThreadPoolExecutorEFvSt10shared_ptrINS1_6ThreadEEERPS1_JRS4_EET_St21__invoke_memfun_derefOT0_OT1_DpOT2_(Unknown Source)
    at Unknown.# 28 _ZSt8__invokeIRMN5folly18ThreadPoolExecutorEFvSt10shared_ptrINS1_6ThreadEEEJRPS1_RS4_EENSt15__invoke_resultIT_JDpT0_EE4typeEOSC_DpOSD_(Unknown Source)
    at Unknown.# 29 _ZNSt5_BindIFMN5folly18ThreadPoolExecutorEFvSt10shared_ptrINS1_6ThreadEEEPS1_S4_EE6__callIvJEJLm0ELm1EEEET_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE(Unknown Source)
    at Unknown.# 30 _ZNSt5_BindIFMN5folly18ThreadPoolExecutorEFvSt10shared_ptrINS1_6ThreadEEEPS1_S4_EEclIJEvEET0_DpOT_(Unknown Source)
    at Unknown.# 31 _ZN5folly6detail8function14FunctionTraitsIFvvEE9callSmallISt5_BindIFMNS_18ThreadPoolExecutorEFvSt10shared_ptrINS7_6ThreadEEEPS7_SA_EEEEvRNS1_4DataE(Unknown Source)
    at Unknown.# 32 0x0000000000000000(Unknown Source)
    at Unknown.# 33 start_thread(Unknown Source)
    at Unknown.# 34 clone(Unknown Source)

aditi-pandit commented 1 year ago

@amitkdutta, @mbasmanova, @oerling

aditi-pandit commented 1 year ago

@xiaoxmeng @majetideepak : This error is in AsyncDataCache. Please can you'll take a stab at it.

ethanyzhang commented 1 year ago

Berthold was able to reproduce it with SF 1K. Internal: https://ibm-analytics.slack.com/archives/C055RAP6BM0/p1695233450279949?thread_ts=1695229124.127739&cid=C055RAP6BM0

oerling commented 1 year ago

The invariant is that the file id is expected to be defined (exist in the FileIds()) map because the HiveDataSource owns a shared ptr to the corresponding FileHandle. The FileHandle has a StringIdLease that keeps the file id to file name mapping live for the duration of the split. The file name acquires an id when the FileHandle is made. As long as the FileHandle exists this mapping stays. The file is then referenced by the CacheInputStream with this id. If this mapping is used in AsyncDataCacheENtry the entry keeps the id to name mapping live because it has a FileCacheKey that contains the StringIdLease for the mapping

Now we see from the stack that we are making an entry with a file id that does not have a live mapping to the file name. If this happens, we presumably have a wrong id in the CacheInputStream that differs from the id in the FileHandle that is owned by the HiveDataSource. This could be checked. The diagnostic is to log the FileHandle on the error throw path (like with a folly::makeGuard) that logs the FileHandle on the error unwind. Or the id - name mapping could have been deleted if its use count had been decremented by some kind of memory corruption. Checking the integrity of the StringIdMap and periodically asserting this is another diagnostic step. The reasonable action at this time is to add an integrity check at the place where the error is signalled.

aditi-pandit commented 1 year ago

Prestissimo Worker config:

presto.version=0.284
coordinator=false
http-server.http.port=8090
discovery.uri=http://xxx.yyy.ibm.com:8091/
optimizer.optimize-hash-generation=false
task.max-drivers-per-task=80
query.max-memory-per-node=600GB

Java Presto co-ordinator config

coordinator=true
node-scheduler.include-coordinator=false
optimizer.optimize-hash-generation=false
http-server.http.port=8091
query.max-memory=1.9TB
query.max-total-memory=2.0TB
query.max-memory-per-node=300GB
query.max-total-memory-per-node=310GB
query.stage-count-warning-threshold=100
query.max-history=3000
# default for heap headroom is :JVM max memory * 0.3; reduce to 20%
memory.heap-headroom-per-node=70GB
discovery-server.enabled=true
discovery.uri=http://xxx.yyyy.ibm.com:8091
task.max-drivers-per-task=80
# To make q35 work: VeloxRuntimeError: vector Unexpected type of the result vector: BIGINT
use-alternative-function-signatures=true

ethanyzhang commented 1 year ago

@majetideepak can you give an update?

ethanyzhang commented 1 year ago

No longer reproducible.

aditi-pandit commented 1 year ago

@yzhang1991 : We may not hit this again if we are lucky. But we should add diagnostics as Orri has suggested in https://github.com/prestodb/presto/issues/20019#issuecomment-1730489305

czentgr commented 1 year ago

@oerling @xiaoxmeng @aditi-pandit The problem appears to not be re-creatable at this time. There were some changes in the area which may have fixed the problem.

Nonetheless, I made a draft PR to propose changes to check the StringMapId maps against each other (https://github.com/facebookincubator/velox/pull/7458) and add a logging point when StringLease's destructor is called (when FileHandle is deallocated). There is already a VLOG(1) log point when the FileHandle starts a lease - so we can see the life cycle of the lease.

If the issue occurs again we can check with a repro and the --v=1 log options when starting presto_server to get better diagnostics.

prestodb / presto

TPC-DS q11, q14, q23, q38, q74, q75, q87: VeloxRuntimeError: it != idToString_.end() #20019