Open wypb opened 3 months ago
Hi @majetideepak @aditi-pandit could you please help review this PR? Thanks!
@wypb can you add some end-to-end tests? Thanks!
@wypb : Would be great to use ORC with the QueryRunners (https://github.com/prestodb/presto/blob/master/presto-native-execution/src/test/java/com/facebook/presto/nativeworker/PrestoNativeQueryRunnerUtils.java) in an e2e test. The test should highlight differences of ORC wrt Parquet, demonstrate filter pushdown as well. Using ORC with Hive and as a format with Iceberg is perfect.
Hi @majetideepak @aditi-pandit I added TPCH tests for ORC, including the Iceberg data source. The TPCDS test for ORC is not added because some types of Velox's ORC reader currently do not implement fast path, which will cause exceptions when reading data.
Caused by: java.lang.RuntimeException: rawResultNulls_ && rawValues_ Split [Hive: file:/data/home/velox/data/code/apache/presto/presto-native-execution/target/velox_data/ORC/hive_data/tpcds/customer/20240626_113756_00003_dsz82_7b5037de-a5c5-4d98-98f3-a626fcf41580 0 - 46945] Task 20240626_115719_00002_hwpq2.22.0.0.0 Operator: PartitionedOutput[root.91] 1
at com.facebook.presto.tests.AbstractTestingPrestoClient.execute(AbstractTestingPrestoClient.java:124)
at com.facebook.presto.tests.DistributedQueryRunner.execute(DistributedQueryRunner.java:777)
at com.facebook.presto.tests.DistributedQueryRunner.execute(DistributedQueryRunner.java:745)
at com.facebook.presto.tests.QueryAssertions.assertQuery(QueryAssertions.java:175)
... 30 more
Caused by: VeloxRuntimeError: rawResultNulls_ && rawValues_ Split [Hive: file:/data/home/velox/data/code/apache/presto/presto-native-execution/target/velox_data/ORC/hive_data/tpcds/customer/20240626_113756_00003_dsz82_7b5037de-a5c5-4d98-98f3-a626fcf41580 0 - 46945] Task 20240626_115719_00002_hwpq2.22.0.0.0 Operator: PartitionedOutput[root.91] 1
at Unknown.# 0 _ZN8facebook5velox7process10StackTraceC1Ei(Unknown Source)
at Unknown.# 1 _ZN8facebook5velox14VeloxException5State4makeIZNS1_C4EPKcmS5_St17basic_string_viewIcSt11char_traitsIcEES9_S9_S9_bNS1_4TypeES9_EUlRT_E_EESt10shared_ptrIKS2_ESA_SB_(Unknown Source)
at Unknown.# 2 _ZN8facebook5velox14VeloxExceptionC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bNS1_4TypeES7_(Unknown Source)
at Unknown.# 3 _ZN8facebook5velox17VeloxRuntimeErrorC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bS7_(Unknown Source)
at Unknown.# 4 _ZN8facebook5velox6detail14veloxCheckFailINS0_17VeloxRuntimeErrorENS1_22CompileTimeEmptyStringEEEvRKNS1_18VeloxCheckFailArgsET0_(Unknown Source)
at Unknown.# 5 _ZN8facebook5velox4dwio6common21SelectiveColumnReader7addNullIiEEvv(Unknown Source)
at Unknown.# 6 _ZN8facebook5velox4dwio6common15ExtractToReader7addNullIiEEvi(Unknown Source)
at Unknown.# 7 _ZN8facebook5velox4dwio6common13ColumnVisitorIiNS0_6common10AlwaysTrueENS2_15ExtractToReaderELb1EE7addNullEv(Unknown Source)
at Unknown.# 8 _ZN8facebook5velox4dwio6common13ColumnVisitorIiNS0_6common10AlwaysTrueENS2_15ExtractToReaderELb1EE19filterPassedForNullEv(Unknown Source)
at Unknown.# 9 _ZN8facebook5velox4dwio6common13ColumnVisitorIiNS0_6common10AlwaysTrueENS2_15ExtractToReaderELb1EE11processNullERb(Unknown Source)
at Unknown.# 10 _ZN8facebook5velox4dwrf12RleDecoderV2ILb0EE15readWithVisitorILb1ENS0_4dwio6common29StringDictionaryColumnVisitorINS0_6common10AlwaysTrueENS6_15ExtractToReaderELb1EEEEEvPKmT0_(Unknown Source)
at Unknown.# 11 _ZN8facebook5velox4dwio6common21SelectiveColumnReader17decodeWithVisitorINS0_4dwrf12RleDecoderV2ILb0EEENS2_29StringDictionaryColumnVisitorINS0_6common10AlwaysTrueENS2_15ExtractToReaderELb1EEEEEvPNS2_10IntDecoderIXsrT_9kIsSignedEEERT0_(Unknown Source)
at Unknown.# 12 _ZN8facebook5velox4dwrf37SelectiveStringDictionaryColumnReader15readWithVisitorINS0_4dwio6common29StringDictionaryColumnVisitorINS0_6common10AlwaysTrueENS5_15ExtractToReaderELb1EEEEEvN5folly5RangeIPKiEET_(Unknown Source)
at Unknown.# 13 _ZN8facebook5velox4dwrf37SelectiveStringDictionaryColumnReader10readHelperINS0_6common10AlwaysTrueELb1ENS0_4dwio6common15ExtractToReaderEEEvPNS4_6FilterEN5folly5RangeIPKiEET1_(Unknown Source)
at Unknown.# 14 _ZN8facebook5velox4dwrf37SelectiveStringDictionaryColumnReader13processFilterILb1ENS0_4dwio6common15ExtractToReaderEEEvPNS0_6common6FilterEN5folly5RangeIPKiEET0_(Unknown Source)
at Unknown.# 15 _ZN8facebook5velox4dwrf37SelectiveStringDictionaryColumnReader4readEiN5folly5RangeIPKiEEPKm(Unknown Source)
at Unknown.# 16 _ZN8facebook5velox4dwio6common12ColumnLoader12loadInternalEN5folly5RangeIPKiEEPNS0_9ValueHookEiPSt10shared_ptrINS0_10BaseVectorEE(Unknown Source)
at Unknown.# 17 _ZN8facebook5velox12VectorLoader4loadEN5folly5RangeIPKiEEPNS0_9ValueHookEiPSt10shared_ptrINS0_10BaseVectorEE(Unknown Source)
at Unknown.# 18 _ZN8facebook5velox12VectorLoader12loadInternalERKNS0_17SelectivityVectorEPNS0_9ValueHookEiPSt10shared_ptrINS0_10BaseVectorEE(Unknown Source)
at Unknown.# 19 _ZN8facebook5velox12VectorLoader4loadERKNS0_17SelectivityVectorEPNS0_9ValueHookEiPSt10shared_ptrINS0_10BaseVectorEE(Unknown Source)
at Unknown.# 20 _ZNK8facebook5velox10LazyVector18loadVectorInternalEv(Unknown Source)
at Unknown.# 21 _ZNK8facebook5velox10LazyVector18loadedVectorSharedEv(Unknown Source)
at Unknown.# 22 _ZNK8facebook5velox10LazyVector12loadedVectorEv(Unknown Source)
at Unknown.# 23 _ZN8facebook5velox10serializer6presto17PrestoVectorSerde22estimateSerializedSizeEPKNS0_10BaseVectorEN5folly5RangeIPKiEEPPiRNS0_7ScratchE(Unknown Source)
at Unknown.# 24 _ZN8facebook5velox17VectorStreamGroup22estimateSerializedSizeEPKNS0_10BaseVectorEN5folly5RangeIPKiEEPPiRNS0_7ScratchE(Unknown Source)
at Unknown.# 25 _ZN8facebook5velox4exec17PartitionedOutput16estimateRowSizesEv(Unknown Source)
at Unknown.# 26 _ZN8facebook5velox4exec17PartitionedOutput8addInputESt10shared_ptrINS0_9RowVectorEE(Unknown Source)
at Unknown.# 27 _ZN8facebook5velox4exec6Driver11runInternalERSt10shared_ptrIS2_ERS3_INS1_13BlockingStateEERS3_INS0_9RowVectorEE(Unknown Source)
at Unknown.# 28 _ZN8facebook5velox4exec6Driver3runESt10shared_ptrIS2_E(Unknown Source)
at Unknown.# 29 _ZZN8facebook5velox4exec6Driver7enqueueESt10shared_ptrIS2_EENKUlvE_clEv(Unknown Source)
at Unknown.# 30 _ZN5folly6detail8function5call_IZN8facebook5velox4exec6Driver7enqueueESt10shared_ptrIS6_EEUlvE_Lb1ELb0EvJEEET2_DpT3_RNS1_4DataE(Unknown Source)
at Unknown.# 31 _ZN5folly6detail8function14FunctionTraitsIFvvEEclEv(Unknown Source)
at Unknown.# 32 _ZN5folly18ThreadPoolExecutor7runTaskERKSt10shared_ptrINS0_6ThreadEEONS0_4TaskE(Unknown Source)
at Unknown.# 33 _ZN5folly21CPUThreadPoolExecutor9threadRunESt10shared_ptrINS_18ThreadPoolExecutor6ThreadEE(Unknown Source)
at Unknown.# 34 _ZSt13__invoke_implIvRMN5folly18ThreadPoolExecutorEFvSt10shared_ptrINS1_6ThreadEEERPS1_JRS4_EET_St21__invoke_memfun_derefOT0_OT1_DpOT2_(Unknown Source)
at Unknown.# 35 _ZSt8__invokeIRMN5folly18ThreadPoolExecutorEFvSt10shared_ptrINS1_6ThreadEEEJRPS1_RS4_EENSt15__invoke_resultIT_JDpT0_EE4typeEOSC_DpOSD_(Unknown Source)
at Unknown.# 36 _ZNSt5_BindIFMN5folly18ThreadPoolExecutorEFvSt10shared_ptrINS1_6ThreadEEEPS1_S4_EE6__callIvJEJLm0ELm1EEEET_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE(Unknown Source)
at Unknown.# 37 _ZNSt5_BindIFMN5folly18ThreadPoolExecutorEFvSt10shared_ptrINS1_6ThreadEEEPS1_S4_EEclIJEvEET0_DpOT_(Unknown Source)
at Unknown.# 38 _ZN5folly6detail8function5call_ISt5_BindIFMNS_18ThreadPoolExecutorEFvSt10shared_ptrINS4_6ThreadEEEPS4_S7_EELb1ELb0EvJEEET2_DpT3_RNS1_4DataE(Unknown Source)
at Unknown.# 39 _ZN5folly6detail8function14FunctionTraitsIFvvEEclEv(Unknown Source)
at Unknown.# 40 _ZZN5folly18NamedThreadFactory9newThreadEONS_8FunctionIFvvEEEENUlvE_clEv(Unknown Source)
at Unknown.# 41 _ZSt13__invoke_implIvZN5folly18NamedThreadFactory9newThreadEONS0_8FunctionIFvvEEEEUlvE_JEET_St14__invoke_otherOT0_DpOT1_(Unknown Source)
at Unknown.# 42 _ZSt8__invokeIZN5folly18NamedThreadFactory9newThreadEONS0_8FunctionIFvvEEEEUlvE_JEENSt15__invoke_resultIT_JDpT0_EE4typeEOS8_DpOS9_(Unknown Source)
at Unknown.# 43 _ZNSt6thread8_InvokerISt5tupleIJZN5folly18NamedThreadFactory9newThreadEONS2_8FunctionIFvvEEEEUlvE_EEE9_M_invokeIJLm0EEEEvSt12_Index_tupleIJXspT_EEE(Unknown Source)
at Unknown.# 44 _ZNSt6thread8_InvokerISt5tupleIJZN5folly18NamedThreadFactory9newThreadEONS2_8FunctionIFvvEEEEUlvE_EEEclEv(Unknown Source)
at Unknown.# 45 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN5folly18NamedThreadFactory9newThreadEONS3_8FunctionIFvvEEEEUlvE_EEEEE6_M_runEv(Unknown Source)
at Unknown.# 46 0x00000000000c2b23(Unknown Source)
at Unknown.# 47 start_thread(Unknown Source)
at Unknown.# 48 clone(Unknown Source)
@wypb : Your code looks fine. When I search for ORC in the presto-native-execution directory I also see the following usage.
https://github.com/prestodb/presto/blob/master/presto-native-execution/src/test/java/com/facebook/presto/nativeworker/AbstractTestWriter.java#L71 needs a fix as well
Please can you check about it.
Good catch, thank you @aditi-pandit I've fixed it.
@aditi-pandit I looked at the code again and found that this should not be removed. testCreateTableWithUnsupportedFormats
is used to test the Velox ORC writer, and Velox currently does not support ORC writing.
Hi @majetideepak @aditi-pandit I added TPCH tests for ORC, including the Iceberg data source. The TPCDS test for ORC is not added because some types of Velox's ORC reader currently do not implement fast path, which will cause exceptions when reading data.
Caused by: java.lang.RuntimeException: rawResultNulls_ && rawValues_ Split [Hive: file:/data/home/velox/data/code/apache/presto/presto-native-execution/target/velox_data/ORC/hive_data/tpcds/customer/20240626_113756_00003_dsz82_7b5037de-a5c5-4d98-98f3-a626fcf41580 0 - 46945] Task 20240626_115719_00002_hwpq2.22.0.0.0 Operator: PartitionedOutput[root.91] 1 at com.facebook.presto.tests.AbstractTestingPrestoClient.execute(AbstractTestingPrestoClient.java:124) at com.facebook.presto.tests.DistributedQueryRunner.execute(DistributedQueryRunner.java:777) at com.facebook.presto.tests.DistributedQueryRunner.execute(DistributedQueryRunner.java:745) at com.facebook.presto.tests.QueryAssertions.assertQuery(QueryAssertions.java:175) ... 30 more Caused by: VeloxRuntimeError: rawResultNulls_ && rawValues_ Split [Hive: file:/data/home/velox/data/code/apache/presto/presto-native-execution/target/velox_data/ORC/hive_data/tpcds/customer/20240626_113756_00003_dsz82_7b5037de-a5c5-4d98-98f3-a626fcf41580 0 - 46945] Task 20240626_115719_00002_hwpq2.22.0.0.0 Operator: PartitionedOutput[root.91] 1 at Unknown.# 0 _ZN8facebook5velox7process10StackTraceC1Ei(Unknown Source) at Unknown.# 1 _ZN8facebook5velox14VeloxException5State4makeIZNS1_C4EPKcmS5_St17basic_string_viewIcSt11char_traitsIcEES9_S9_S9_bNS1_4TypeES9_EUlRT_E_EESt10shared_ptrIKS2_ESA_SB_(Unknown Source) at Unknown.# 2 _ZN8facebook5velox14VeloxExceptionC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bNS1_4TypeES7_(Unknown Source) at Unknown.# 3 _ZN8facebook5velox17VeloxRuntimeErrorC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bS7_(Unknown Source) at Unknown.# 4 _ZN8facebook5velox6detail14veloxCheckFailINS0_17VeloxRuntimeErrorENS1_22CompileTimeEmptyStringEEEvRKNS1_18VeloxCheckFailArgsET0_(Unknown Source) at Unknown.# 5 _ZN8facebook5velox4dwio6common21SelectiveColumnReader7addNullIiEEvv(Unknown Source) at Unknown.# 6 _ZN8facebook5velox4dwio6common15ExtractToReader7addNullIiEEvi(Unknown Source) at Unknown.# 7 _ZN8facebook5velox4dwio6common13ColumnVisitorIiNS0_6common10AlwaysTrueENS2_15ExtractToReaderELb1EE7addNullEv(Unknown Source) at Unknown.# 8 _ZN8facebook5velox4dwio6common13ColumnVisitorIiNS0_6common10AlwaysTrueENS2_15ExtractToReaderELb1EE19filterPassedForNullEv(Unknown Source) at Unknown.# 9 _ZN8facebook5velox4dwio6common13ColumnVisitorIiNS0_6common10AlwaysTrueENS2_15ExtractToReaderELb1EE11processNullERb(Unknown Source) at Unknown.# 10 _ZN8facebook5velox4dwrf12RleDecoderV2ILb0EE15readWithVisitorILb1ENS0_4dwio6common29StringDictionaryColumnVisitorINS0_6common10AlwaysTrueENS6_15ExtractToReaderELb1EEEEEvPKmT0_(Unknown Source) at Unknown.# 11 _ZN8facebook5velox4dwio6common21SelectiveColumnReader17decodeWithVisitorINS0_4dwrf12RleDecoderV2ILb0EEENS2_29StringDictionaryColumnVisitorINS0_6common10AlwaysTrueENS2_15ExtractToReaderELb1EEEEEvPNS2_10IntDecoderIXsrT_9kIsSignedEEERT0_(Unknown Source) at Unknown.# 12 _ZN8facebook5velox4dwrf37SelectiveStringDictionaryColumnReader15readWithVisitorINS0_4dwio6common29StringDictionaryColumnVisitorINS0_6common10AlwaysTrueENS5_15ExtractToReaderELb1EEEEEvN5folly5RangeIPKiEET_(Unknown Source) at Unknown.# 13 _ZN8facebook5velox4dwrf37SelectiveStringDictionaryColumnReader10readHelperINS0_6common10AlwaysTrueELb1ENS0_4dwio6common15ExtractToReaderEEEvPNS4_6FilterEN5folly5RangeIPKiEET1_(Unknown Source) at Unknown.# 14 _ZN8facebook5velox4dwrf37SelectiveStringDictionaryColumnReader13processFilterILb1ENS0_4dwio6common15ExtractToReaderEEEvPNS0_6common6FilterEN5folly5RangeIPKiEET0_(Unknown Source) at Unknown.# 15 _ZN8facebook5velox4dwrf37SelectiveStringDictionaryColumnReader4readEiN5folly5RangeIPKiEEPKm(Unknown Source) at Unknown.# 16 _ZN8facebook5velox4dwio6common12ColumnLoader12loadInternalEN5folly5RangeIPKiEEPNS0_9ValueHookEiPSt10shared_ptrINS0_10BaseVectorEE(Unknown Source) at Unknown.# 17 _ZN8facebook5velox12VectorLoader4loadEN5folly5RangeIPKiEEPNS0_9ValueHookEiPSt10shared_ptrINS0_10BaseVectorEE(Unknown Source) at Unknown.# 18 _ZN8facebook5velox12VectorLoader12loadInternalERKNS0_17SelectivityVectorEPNS0_9ValueHookEiPSt10shared_ptrINS0_10BaseVectorEE(Unknown Source) at Unknown.# 19 _ZN8facebook5velox12VectorLoader4loadERKNS0_17SelectivityVectorEPNS0_9ValueHookEiPSt10shared_ptrINS0_10BaseVectorEE(Unknown Source) at Unknown.# 20 _ZNK8facebook5velox10LazyVector18loadVectorInternalEv(Unknown Source) at Unknown.# 21 _ZNK8facebook5velox10LazyVector18loadedVectorSharedEv(Unknown Source) at Unknown.# 22 _ZNK8facebook5velox10LazyVector12loadedVectorEv(Unknown Source) at Unknown.# 23 _ZN8facebook5velox10serializer6presto17PrestoVectorSerde22estimateSerializedSizeEPKNS0_10BaseVectorEN5folly5RangeIPKiEEPPiRNS0_7ScratchE(Unknown Source) at Unknown.# 24 _ZN8facebook5velox17VectorStreamGroup22estimateSerializedSizeEPKNS0_10BaseVectorEN5folly5RangeIPKiEEPPiRNS0_7ScratchE(Unknown Source) at Unknown.# 25 _ZN8facebook5velox4exec17PartitionedOutput16estimateRowSizesEv(Unknown Source) at Unknown.# 26 _ZN8facebook5velox4exec17PartitionedOutput8addInputESt10shared_ptrINS0_9RowVectorEE(Unknown Source) at Unknown.# 27 _ZN8facebook5velox4exec6Driver11runInternalERSt10shared_ptrIS2_ERS3_INS1_13BlockingStateEERS3_INS0_9RowVectorEE(Unknown Source) at Unknown.# 28 _ZN8facebook5velox4exec6Driver3runESt10shared_ptrIS2_E(Unknown Source) at Unknown.# 29 _ZZN8facebook5velox4exec6Driver7enqueueESt10shared_ptrIS2_EENKUlvE_clEv(Unknown Source) at Unknown.# 30 _ZN5folly6detail8function5call_IZN8facebook5velox4exec6Driver7enqueueESt10shared_ptrIS6_EEUlvE_Lb1ELb0EvJEEET2_DpT3_RNS1_4DataE(Unknown Source) at Unknown.# 31 _ZN5folly6detail8function14FunctionTraitsIFvvEEclEv(Unknown Source) at Unknown.# 32 _ZN5folly18ThreadPoolExecutor7runTaskERKSt10shared_ptrINS0_6ThreadEEONS0_4TaskE(Unknown Source) at Unknown.# 33 _ZN5folly21CPUThreadPoolExecutor9threadRunESt10shared_ptrINS_18ThreadPoolExecutor6ThreadEE(Unknown Source) at Unknown.# 34 _ZSt13__invoke_implIvRMN5folly18ThreadPoolExecutorEFvSt10shared_ptrINS1_6ThreadEEERPS1_JRS4_EET_St21__invoke_memfun_derefOT0_OT1_DpOT2_(Unknown Source) at Unknown.# 35 _ZSt8__invokeIRMN5folly18ThreadPoolExecutorEFvSt10shared_ptrINS1_6ThreadEEEJRPS1_RS4_EENSt15__invoke_resultIT_JDpT0_EE4typeEOSC_DpOSD_(Unknown Source) at Unknown.# 36 _ZNSt5_BindIFMN5folly18ThreadPoolExecutorEFvSt10shared_ptrINS1_6ThreadEEEPS1_S4_EE6__callIvJEJLm0ELm1EEEET_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE(Unknown Source) at Unknown.# 37 _ZNSt5_BindIFMN5folly18ThreadPoolExecutorEFvSt10shared_ptrINS1_6ThreadEEEPS1_S4_EEclIJEvEET0_DpOT_(Unknown Source) at Unknown.# 38 _ZN5folly6detail8function5call_ISt5_BindIFMNS_18ThreadPoolExecutorEFvSt10shared_ptrINS4_6ThreadEEEPS4_S7_EELb1ELb0EvJEEET2_DpT3_RNS1_4DataE(Unknown Source) at Unknown.# 39 _ZN5folly6detail8function14FunctionTraitsIFvvEEclEv(Unknown Source) at Unknown.# 40 _ZZN5folly18NamedThreadFactory9newThreadEONS_8FunctionIFvvEEEENUlvE_clEv(Unknown Source) at Unknown.# 41 _ZSt13__invoke_implIvZN5folly18NamedThreadFactory9newThreadEONS0_8FunctionIFvvEEEEUlvE_JEET_St14__invoke_otherOT0_DpOT1_(Unknown Source) at Unknown.# 42 _ZSt8__invokeIZN5folly18NamedThreadFactory9newThreadEONS0_8FunctionIFvvEEEEUlvE_JEENSt15__invoke_resultIT_JDpT0_EE4typeEOS8_DpOS9_(Unknown Source) at Unknown.# 43 _ZNSt6thread8_InvokerISt5tupleIJZN5folly18NamedThreadFactory9newThreadEONS2_8FunctionIFvvEEEEUlvE_EEE9_M_invokeIJLm0EEEEvSt12_Index_tupleIJXspT_EEE(Unknown Source) at Unknown.# 44 _ZNSt6thread8_InvokerISt5tupleIJZN5folly18NamedThreadFactory9newThreadEONS2_8FunctionIFvvEEEEUlvE_EEEclEv(Unknown Source) at Unknown.# 45 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN5folly18NamedThreadFactory9newThreadEONS3_8FunctionIFvvEEEEUlvE_EEEEE6_M_runEv(Unknown Source) at Unknown.# 46 0x00000000000c2b23(Unknown Source) at Unknown.# 47 start_thread(Unknown Source) at Unknown.# 48 clone(Unknown Source)
@wypb : Had a question about this point you raised... You are saying that HiveQueryRunner can't read TPC-DS tables, but handles. That seems odd. Did you look deeper into what TPC-H is doing different ? The main difference is that in TPC-H all date columns were exposed as VARCHAR. But wonder if there is anything else ? Would be great to see which particular column here is problematic.
@wypb is this PR still being worked on?
Hi @tdcmeehan sorry for the late reply.
Yes, I'm still keeping an eye on this. I've been working on a few Velox PRs lately, so I haven't had time to work on this yet. I'll update this PR later this week.
Let's add ORC as a supported file format in Supported Use Cases (we can also mention that Parquet is a supported format).
Should this be mentioned in the doc, maybe in Supported Use Cases or Presto C++ Features?
Hi @tdcmeehan, @aditi-pandit sorry for the late reply.
@wypb : Had a question about this point you raised... You are saying that HiveQueryRunner can't read TPC-DS tables, but handles. That seems odd. Did you look deeper into what TPC-H is doing different ? The main difference is that in TPC-H all date columns were exposed as VARCHAR. But wonder if there is anything else ? Would be great to see which particular column here is problematic.
I was also curious about this question before, but I didn't check the reason. Today I checked why most of the TPCDS queries failed, while all the TPCH queries passed. I debugged the code and found that the integer fields of the TPCDS table (such as the cs_sold_date_sk
field of the catalog_sales
table) may be NULL, and Velox does not implement the fastpath logic for integer fields encoded as RLEv2 in ORC. These two reasons combined cause most of the TPCDS queries to fail. The TPCH table fields will not be NULL, so this exception will not be triggered.
For related code, see SelectiveColumnReader::prepareNulls
https://github.com/facebookincubator/velox/blob/main/velox/dwio/common/SelectiveColumnReader.cpp#L103-L129
void SelectiveColumnReader::prepareNulls(
RowSet rows,
bool hasNulls,
int32_t extraRows) {
if (!hasNulls) {
anyNulls_ = false;
return;
}
initReturnReaderNulls(rows);
if (returnReaderNulls_) {
// No need for null flags if fast path.
return;
}
auto numRows = rows.size() + extraRows;
if (resultNulls_ && resultNulls_->unique() &&
resultNulls_->capacity() >= bits::nbytes(numRows) + simd::kPadding) {
resultNulls_->setSize(bits::nbytes(numRows));
} else {
resultNulls_ = AlignedBuffer::allocate<bool>(
numRows + (simd::kPadding * 8), &memoryPool_);
rawResultNulls_ = resultNulls_->asMutable<uint64_t>();
}
anyNulls_ = false;
// Clear whole capacity because future uses could hit uncleared data between
// capacity() and 'numBytes'.
simd::memset(rawResultNulls_, bits::kNotNullByte, resultNulls_->capacity());
}
For the TPCH table, hasNulls
is false, so there is no need to initialize rawResultNulls_
, and SelectiveColumnReader#addNull()
will not be called later (there is a VELOX_DCHECK(rawResultNulls_ && rawValues_)
in it, which causes the query of the TPCDS table to report an exception); for the TPCDS table, hasNulls
is true, and then SelectiveColumnReader::initReturnReaderNulls
is executed, returnReaderNulls_ = true
is calculated, and then it returns. In SelectiveColumnReader::prepareNulls
, rawResultNulls_
will not be initialized, which causes an exception in the subsequent call to SelectiveColumnReader#addNull()
.
void SelectiveColumnReader::initReturnReaderNulls(RowSet rows) {
if (useBulkPath() && !scanSpec_->hasFilter()) {
anyNulls_ = nullsInReadRange_ != nullptr;
bool isDense = rows.back() == rows.size() - 1;
returnReaderNulls_ = anyNulls_ && isDense;
} else {
returnReaderNulls_ = false;
}
}
If we modify the implementation of SelectiveIntegerDirectColumnReader#hasBulkPath()
to the following logic, the TPCDS query will also succeed.
bool hasBulkPath() const override {
return format == DwrfFormat::kOrc && version == RleVersion_2 ? false : true;
}
As per https://orc.apache.org/docs/types.html ORC supports DATE type. The DWRF reader doesn't support DATE as a first-class and so we coerced all those columns to VARCHAR in tests. Do you have a plan for those ?
DWRF does not support the DATE type, but Velox queries ORC's DATE type using SelectiveIntegerDirectColumnReader
. My test shows that the DATE type data can be read correctly. So I don't think it is necessary to convert the DATE type to VARCHAR.
Hi @tdcmeehan @steveburnett I have added relevant documents in Supported Use Cases, please help me review it, thank you.
Hi @tdcmeehan, @aditi-pandit sorry for the late reply.
@wypb : Had a question about this point you raised... You are saying that HiveQueryRunner can't read TPC-DS tables, but handles. That seems odd. Did you look deeper into what TPC-H is doing different ? The main difference is that in TPC-H all date columns were exposed as VARCHAR. But wonder if there is anything else ? Would be great to see which particular column here is problematic.
I was also curious about this question before, but I didn't check the reason. Today I checked why most of the TPCDS queries failed, while all the TPCH queries passed. I debugged the code and found that the integer fields of the TPCDS table (such as the
cs_sold_date_sk
field of thecatalog_sales
table) may be NULL, and Velox does not implement the fastpath logic for integer fields encoded as RLEv2 in ORC. These two reasons combined cause most of the TPCDS queries to fail. The TPCH table fields will not be NULL, so this exception will not be triggered.For related code, see
SelectiveColumnReader::prepareNulls
https://github.com/facebookincubator/velox/blob/main/velox/dwio/common/SelectiveColumnReader.cpp#L103-L129void SelectiveColumnReader::prepareNulls( RowSet rows, bool hasNulls, int32_t extraRows) { if (!hasNulls) { anyNulls_ = false; return; } initReturnReaderNulls(rows); if (returnReaderNulls_) { // No need for null flags if fast path. return; } auto numRows = rows.size() + extraRows; if (resultNulls_ && resultNulls_->unique() && resultNulls_->capacity() >= bits::nbytes(numRows) + simd::kPadding) { resultNulls_->setSize(bits::nbytes(numRows)); } else { resultNulls_ = AlignedBuffer::allocate<bool>( numRows + (simd::kPadding * 8), &memoryPool_); rawResultNulls_ = resultNulls_->asMutable<uint64_t>(); } anyNulls_ = false; // Clear whole capacity because future uses could hit uncleared data between // capacity() and 'numBytes'. simd::memset(rawResultNulls_, bits::kNotNullByte, resultNulls_->capacity()); }
For the TPCH table,
hasNulls
is false, so there is no need to initializerawResultNulls_
, andSelectiveColumnReader#addNull()
will not be called later (there is aVELOX_DCHECK(rawResultNulls_ && rawValues_)
in it, which causes the query of the TPCDS table to report an exception); for the TPCDS table,hasNulls
is true, and thenSelectiveColumnReader::initReturnReaderNulls
is executed,returnReaderNulls_ = true
is calculated, and then it returns. InSelectiveColumnReader::prepareNulls
,rawResultNulls_
will not be initialized, which causes an exception in the subsequent call toSelectiveColumnReader#addNull()
.void SelectiveColumnReader::initReturnReaderNulls(RowSet rows) { if (useBulkPath() && !scanSpec_->hasFilter()) { anyNulls_ = nullsInReadRange_ != nullptr; bool isDense = rows.back() == rows.size() - 1; returnReaderNulls_ = anyNulls_ && isDense; } else { returnReaderNulls_ = false; } }
If we modify the implementation of
SelectiveIntegerDirectColumnReader#hasBulkPath()
to the following logic, the TPCDS query will also succeed.bool hasBulkPath() const override { return format == DwrfFormat::kOrc && version == RleVersion_2 ? false : true; }
@wypb : Thanks. This is a good find. Would be great to follow up the Velox change and then submit this PR as having the NULL values should be a common use-case.
Description
We have recently merged the PR for reading ORC statistics and implementing OrcReader based on DwrfReader on the velox side. Now it is time to add support for ORC reader it in Prestissimo.
NOTE: Because Presto uses RLEv2 encoding to write ORC files, and some types of Velox ORC readers do not implement fast path readers, which will cause exceptions when Velox reads ORC, so end-to-end tests for TPCDS in ORC are not added here. Once Velox implements fast path readers for ORC RLEv2 encoding, we need to add ORC tests.