trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.43k stars 3k forks source link

Special characters in partitioned column fails in Azure storage #17012

Open pajaks opened 1 year ago

pajaks commented 1 year ago

Inserting values with special characters into partition column causing select to fail with FileNotFoundException.

CREATE TABLE tableName  (key integer, value varchar) WITH (location =  tableLocation , partitioned_by = ARRAY['value']);
INSERT INTO  tableName  VALUES (1, 'with?question');
SELECT * FROM  tableName;

Test case for reference: https://github.com/pajaks/trino/pull/8/files#diff-5053d5f66df3044008495f77d7722a7785aef8bd22c26e5b4f95819cb82c31a4R1708-R1728

Caused by: io.trino.spi.TrinoException: Error opening Hive split abfs://container@automation3.dfs.core.windows.net/test-delta-lake-integration-smoke-test-mn0vm27bvv/test_optimize_partitioned_table_csjbhy44ko/value=with?question/20230413_092851_00004_bhvvq-18bc91aa-2b1c-458a-a297-cc33971ba25e (offset=0, length=199): HEAD https://automation3.dfs.core.windows.net/container/test-delta-lake-integration-smoke-test-mn0vm27bvv/test_optimize_partitioned_table_csjbhy44ko/value%3Dwith%3Fquestion/20230413_092851_00004_bhvvq-18bc91aa-2b1c-458a-a297-cc33971ba25e?timeout=90
StatusCode=404
StatusDescription=The specified path does not exist.
ErrorCode=
ErrorMessage=
    at io.trino.plugin.hive.parquet.ParquetPageSourceFactory.createPageSource(ParquetPageSourceFactory.java:312)
    at io.trino.plugin.deltalake.DeltaLakePageSourceProvider.createPageSource(DeltaLakePageSourceProvider.java:203)
    at io.trino.plugin.base.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:49)
    at io.trino.split.PageSourceManager.createPageSource(PageSourceManager.java:62)
    at io.trino.operator.ScanFilterAndProjectOperator$SplitToPages.process(ScanFilterAndProjectOperator.java:266)
    at io.trino.operator.ScanFilterAndProjectOperator$SplitToPages.process(ScanFilterAndProjectOperator.java:194)
    at io.trino.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:360)
    at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:413)
    at io.trino.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:347)
    at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:413)
    at io.trino.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:347)
    at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:413)
    at io.trino.operator.WorkProcessorUtils.getNextState(WorkProcessorUtils.java:262)
    at io.trino.operator.WorkProcessorUtils.lambda$processStateMonitor$2(WorkProcessorUtils.java:241)
    at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:413)
    at io.trino.operator.WorkProcessorUtils.getNextState(WorkProcessorUtils.java:262)
    at io.trino.operator.WorkProcessorUtils.lambda$finishWhen$3(WorkProcessorUtils.java:256)
    at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:413)
    at io.trino.operator.WorkProcessorSourceOperatorAdapter.getOutput(WorkProcessorSourceOperatorAdapter.java:146)
    at io.trino.operator.Driver.processInternal(Driver.java:402)
    at io.trino.operator.Driver.lambda$process$8(Driver.java:305)
    at io.trino.operator.Driver.tryWithLock(Driver.java:701)
    at io.trino.operator.Driver.process(Driver.java:297)
    at io.trino.operator.Driver.processForDuration(Driver.java:268)
    at io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:845)
    at io.trino.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:165)
    at io.trino.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:537)
    at io.trino.$gen.Trino_testversion____20230413_092812_1.run(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.io.FileNotFoundException: HEAD https://automation3.dfs.core.windows.net/container/test-delta-lake-integration-smoke-test-mn0vm27bvv/test_optimize_partitioned_table_csjbhy44ko/value%3Dwith%3Fquestion/20230413_092851_00004_bhvvq-18bc91aa-2b1c-458a-a297-cc33971ba25e?timeout=90
StatusCode=404
StatusDescription=The specified path does not exist.
ErrorCode=
ErrorMessage=
    at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.checkException(AzureBlobFileSystem.java:926)
    at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.open(AzureBlobFileSystem.java:177)
    at io.trino.hdfs.TrinoFileSystemCache$FileSystemWrapper.open(TrinoFileSystemCache.java:393)
    at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:906)
    at io.trino.filesystem.hdfs.HdfsInputFile.lambda$openFile$1(HdfsInputFile.java:108)
    at io.trino.hdfs.authentication.NoHdfsAuthentication.doAs(NoHdfsAuthentication.java:25)
    at io.trino.hdfs.HdfsEnvironment.doAs(HdfsEnvironment.java:93)
    at io.trino.filesystem.hdfs.HdfsInputFile.openFile(HdfsInputFile.java:108)
    at io.trino.filesystem.hdfs.HdfsInputFile.newInput(HdfsInputFile.java:57)
    at io.trino.plugin.hive.parquet.TrinoParquetDataSource.<init>(TrinoParquetDataSource.java:39)
    at io.trino.plugin.hive.parquet.ParquetPageSourceFactory.createPageSource(ParquetPageSourceFactory.java:227)
    ... 30 more
Caused by: org.apache.hadoop.fs.azurebfs.contracts.exceptions.AbfsRestOperationException: HEAD https://automation3.dfs.core.windows.net/container/test-delta-lake-integration-smoke-test-mn0vm27bvv/test_optimize_partitioned_table_csjbhy44ko/value%3Dwith%3Fquestion/20230413_092851_00004_bhvvq-18bc91aa-2b1c-458a-a297-cc33971ba25e?timeout=90
StatusCode=404
StatusDescription=The specified path does not exist.
ErrorCode=
ErrorMessage=
    at org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.execute(AbfsRestOperation.java:134)
    at org.apache.hadoop.fs.azurebfs.services.AbfsClient.getPathProperties(AbfsClient.java:352)
    at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.openFileForRead(AzureBlobFileSystemStore.java:349)
    at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.open(AzureBlobFileSystem.java:174)
    ... 39 more
ebyhr commented 1 year ago

Might relate to https://issues.apache.org/jira/browse/HADOOP-18580

vinay-kl commented 1 year ago

@ebyhr @findinpath https://github.com/trinodb/trino/pull/17038#discussion_r1175891501, the code mentioned here would fix 17012 issue, should i create a separate PR for this?

The aforementioned fix also handles absolute path reading as well as special characters in partition values.

degant commented 1 year ago

Is this fix on the roadmap, and do you have a ETA? We're trying to use delta lake connector with ADLS and have a large number of partitioned files generated by Spark with special characters. Removing the special characters will not work for us since there are too many files