prestodb / presto

The official home of the Presto distributed SQL query engine for big data
http://prestodb.io
Apache License 2.0
16.06k stars 5.38k forks source link

Presto C++ can't read Iceberg tables with spaces in the column names #23131

Closed yingsu00 closed 2 days ago

yingsu00 commented 4 months ago

Any column with whitespace in the name seems to cause this issue. Reproduced using the following steps:

In a Presto engine:

create table iceberg_data.pool.space ("two words" int) with (format='parquet');
set session iceberg_data.parquet_writer_version = 'PARQUET_1_0';
insert into iceberg_data.pool.space values (1), (2), (3);

Then from prestissimo:

select * from iceberg_data.pool.space;

Backtrace from GDB

``` #0 facebook::velox::common::Tokenizer::computeNext (this=0xfde2c9fb4640) at /workspaces/presto/presto-native-execution/velox/velox/type/Tokenizer.cpp:83 #1 0x00000000087fe224 in facebook::velox::common::Tokenizer::tryToComputeNext (this=0xfde2c9fb4640) at /workspaces/presto/presto-native-execution/velox/velox/type/Tokenizer.cpp:218 #2 0x00000000087fd5dc in facebook::velox::common::Tokenizer::hasNext (this=0xfde2c9fb4640) at /workspaces/presto/presto-native-execution/velox/velox/type/Tokenizer.cpp:39 #3 0x00000000087f70f8 in facebook::velox::common::Subfield::Subfield (this=0xfde2c9fb4720, path="two words", separators=std::shared_ptr (use count 2, weak count 0) = {...}) at /workspaces/presto/presto-native-execution/velox/velox/type/Subfield.cpp:34 #4 0x0000000001c745a0 in facebook::velox::common::ScanSpec::addField (this=0xfde2b4002d30, name="two words", channel=0) at /workspaces/presto/presto-native-execution/velox/velox/dwio/common/ScanSpec.cpp:386 #5 0x0000000001c74624 in facebook::velox::common::ScanSpec::addFieldRecursively (this=0xfde2b4002d30, name="two words", type=..., channel=0) at /workspaces/presto/presto-native-execution/velox/velox/dwio/common/ScanSpec.cpp:396 #6 0x0000000000627f24 in facebook::velox::connector::hive::makeScanSpec (rowType=std::shared_ptr (use count 1, weak count 0) = {...}, outputSubfields=..., filters=std::unordered_map with 0 elements, dataColumns=std::shared_ptr (use count 1, weak count 0) = {...}, partitionKeys=std::unordered_map with 0 elements, infoColumns=std::unordered_map with 0 elements, pool=0xfde2e4033c70) at /workspaces/presto/presto-native-execution/velox/velox/connectors/hive/HiveConnectorUtil.cpp:365 #7 0x00000000006ade1c in facebook::velox::connector::hive::HiveDataSource::HiveDataSource (this=0xfde2b4002ab0, outputType=std::shared_ptr (use count 3, weak count 0) = {...}, tableHandle=std::shared_ptr (use count 3, weak count 0) = {...}, columnHandles=std::unordered_map with 1 element = {...}, fileHandleFactory=0x43b8dc98, executor=0x43b83940, connectorQueryCtx=0xfde2cc0be720, hiveConfig=std::shared_ptr (use count 2, weak count 0) = {...}) at /workspaces/presto/presto-native-execution/velox/velox/connectors/hive/HiveDataSource.cpp:149 #8 0x0000000000607abc in std::make_unique const&, std::shared_ptr const&, std::unordered_map, std::allocator >, std::shared_ptr, std::hash, std::allocator > >, std::equal_to, std::allocator > >, std::allocator, std::allocator > const, std::shared_ptr > > > const&, facebook::velox::CachedFactory, std::allocator >, std::shared_ptr, facebook::velox::FileHandleGenerator>*, folly::Executor*&, facebook::velox::connector::ConnectorQueryCtx*&, std::shared_ptr const&> () at /usr/include/c++/11/bits/unique_ptr.h:962 #9 0x00000000006004b4 in facebook::velox::connector::hive::HiveConnector::createDataSource (this=0x43b8dc60, outputType=std::shared_ptr (use count 3, weak count 0) = {...}, tableHandle=std::shared_ptr (use count 3, weak count 0) = {...}, columnHandles=std::unordered_map with 1 element = {...}, connectorQueryCtx=0xfde2cc0be720) at /workspaces/presto/presto-native-execution/velox/velox/connectors/hive/HiveConnector.cpp:85 #10 0x0000000007eb33d8 in facebook::velox::exec::TableScan::getOutput (this=0xfde2e4033660) at /workspaces/presto/presto-native-execution/velox/velox/exec/TableScan.cpp:155 #11 0x0000000007cb6ab8 in facebook::velox::exec::Driver::runInternal (this=0xfde2e4033570, self=std::shared_ptr (use count 3, weak count 1) = {...}, blockingState=std::shared_ptr (empty) = {...}, result=std::shared_ptr (empty) = {...}) at /workspaces/presto/presto-native-execution/velox/velox/exec/Driver.cpp:594 #12 0x0000000007cb858c in facebook::velox::exec::Driver::run (self=std::shared_ptr (use count 3, weak count 1) = {...}) at /workspaces/presto/presto-native-execution/velox/velox/exec/Driver.cpp:776 #13 0x0000000007cb4cd4 in operator() (__closure=0xfde2c9fb6430) at /workspaces/presto/presto-native-execution/velox/velox/exec/Driver.cpp:270 #14 0x0000000007cbba2c in folly::detail::function::FunctionTraits::callSmall):: >(folly::detail::function::Data &) (p=...) at /usr/local/include/folly/Function.h:341 #15 0x00000000005de5dc in folly::detail::function::FunctionTraits::operator()() (this=0xfde2c9fb6430) at /usr/local/include/folly/Function.h:363 #16 0x0000fde58cca0ab8 in folly::catch_exception&, void (&)(char const*) noexcept, char const*&, void>(folly::Function&, void (&)(char const*) noexcept, char const*&) (c=, t=...) at /root/setup/folly/folly/lang/Exception.h:286 #17 folly::Executor::invokeCatchingExns >(char const*, folly::Function) (f=..., p=0xfde58ccc0038 "ThreadPoolExecutor: func") at /root/setup/folly/folly/Executor.h:234 #18 folly::ThreadPoolExecutor::runTask (this=0x43b792c0, thread=std::shared_ptr (use count 3, weak count 0) = {...}, task=...) at /root/setup/folly/folly/executors/ThreadPoolExecutor.cpp:142 #19 0x0000000008886880 in folly::CPUThreadPoolExecutor::threadRun (this=0x43b792c0, thread=std::shared_ptr (use count 3, weak count 0) = {...}) at /root/setup/folly/folly/executors/CPUThreadPoolExecutor.cpp:350 #20 0x0000fde58ccac294 in std::__invoke_impl), folly::ThreadPoolExecutor*&, std::shared_ptr&> (__f=@0xfde2d4023240: &virtual folly::ThreadPoolExecutor::threadRun(std::shared_ptr), __t=@0xfde2d4023260: 0x43b792c0) at /usr/include/c++/11/bits/invoke.h:74 #21 0x0000fde58ccaaf28 in std::__invoke), folly::ThreadPoolExecutor*&, std::shared_ptr&> (__fn=@0xfde2d4023240: &virtual folly::ThreadPoolExecutor::threadRun(std::shared_ptr)) at /usr/include/c++/11/bits/invoke.h:96 #22 0x0000fde58cca9af4 in std::_Bind))(std::shared_ptr)>::__call(std::tuple<>&&, std::_Index_tuple<0ul, 1ul>) (this=0xfde2d4023240, __args=...) at /usr/include/c++/11/functional:420 #23 0x0000fde58cca86f4 in std::_Bind))(std::shared_ptr)>::operator()<, void>() (this=0xfde2d4023240) at /usr/include/c++/11/functional:503 #24 0x0000fde58cca6dc0 in folly::detail::function::FunctionTraits::callSmall))(std::shared_ptr)> >(folly::detail::function::Data&) (p=...) at /root/setup/folly/folly/Function.h:341 #25 0x00000000005de5dc in folly::detail::function::FunctionTraits::operator()() (this=0xfde2d4023240) at /usr/local/include/folly/Function.h:363 #26 0x00000000009bedf0 in folly::NamedThreadFactory::newThread(folly::Function&&)::{lambda()#1}::operator()() (__closure=0xfde2d4023240) at /root/setup/folly/folly/executors/thread_factory/NamedThreadFactory.h:40 #27 0x00000000009fde2c in std::__invoke_impl&&)::{lambda()#1}>(std::__invoke_other, folly::NamedThreadFactory::newThread(folly::Function&&)::{lambda()#1}&&) (__f=...) at /usr/include/c++/11/bits/invoke.h:61 #28 0x00000000009fdde4 in std::__invoke&&)::{lambda()#1}>(folly::NamedThreadFactory::newThread(folly::Function&&)::{lambda()#1}&&) (__fn=...) at /usr/include/c++/11/bits/invoke.h:96 #29 0x00000000009fdd80 in std::thread::_Invoker&&)::{lambda()#1}> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) (this=0xfde2d4023240) at /usr/include/c++/11/bits/std_thread.h:259 #30 0x00000000009fdb70 in std::thread::_Invoker&&)::{lambda()#1}> >::operator()() (this=0xfde2d4023240) at /usr/include/c++/11/bits/std_thread.h:266 #31 0x00000000009fd788 in std::thread::_State_impl&&)::{lambda()#1}> > >::_M_run() (this=0xfde2d4023230) at /usr/include/c++/11/bits/std_thread.h:211 #32 0x0000fde589a368fc in execute_native_thread_routine () from /lib64/libstdc++.so.6 #33 0x0000fde589754698 in start_thread () from /lib64/libc.so.6 #34 0x0000fde5897bebdc in thread_start () from /lib64/libc.so.6 ```

Your Environment

N/A

Expected Behavior

should read the values out

This issue is caused by several issues:

  1. Velox tokenizer doesn't honor spaces: https://github.com/facebookincubator/velox/issues/10348
  2. After the 1st issue is fixed, the query still cannot output the values, instead it output all NULLs. This is because
yingsu00 commented 2 days ago

Fixed in the Tokenizer