Closed suryanshagnihotri closed 10 months ago
Trino Version : 389
Can you make sure if the issue still happens with the latest version?
Hi @ebyhr I used trino latest docker image to run the test , it is on 431 version. In hive.properties i changed the metastore uri to point to my metastore so that i can see the table. After running the query i do not see the above error but it does fail with different error
io.trino.spi.TrinoException: Error opening Hive split oci://bucket/test_trino_par/store_sales/ss_sold_date_sk=__HIVE_DEFAULT_PARTITION__/part-00042-eac89ce9-041a-4254-b90a-6aceb3c8e6c4.c000.snappy.parquet (offset=33554432, length=33554432): class "shaded.parquet.org.apache.thrift.transport.TMemoryBuffer"'s signer information does not match signer information of other classes in the same package
at io.trino.plugin.hive.parquet.ParquetPageSourceFactory.createPageSource(ParquetPageSourceFactory.java:331)
at io.trino.plugin.hive.parquet.ParquetPageSourceFactory.createPageSource(ParquetPageSourceFactory.java:185)
at io.trino.plugin.hive.HivePageSourceProvider.createHivePageSource(HivePageSourceProvider.java:202)
at io.trino.plugin.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:137)
at io.trino.plugin.base.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:48)
at io.trino.split.PageSourceManager.createPageSource(PageSourceManager.java:61)
at io.trino.operator.TableScanOperator.getOutput(TableScanOperator.java:299)
at io.trino.operator.Driver.processInternal(Driver.java:395)
at io.trino.operator.Driver.lambda$process$8(Driver.java:298)
at io.trino.operator.Driver.tryWithLock(Driver.java:694)
at io.trino.operator.Driver.process(Driver.java:290)
at io.trino.operator.Driver.processForDuration(Driver.java:261)
at io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:887)
at io.trino.execution.executor.timesharing.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:187)
at io.trino.execution.executor.timesharing.TimeSharingTaskExecutor$TaskRunner.run(TimeSharingTaskExecutor.java:565)
at io.trino.$gen.Trino_431____20231031_120106_2.run(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.lang.SecurityException: class "shaded.parquet.org.apache.thrift.transport.TMemoryBuffer"'s signer information does not match signer information of other classes in the same package
at java.base/java.lang.ClassLoader.checkCerts(ClassLoader.java:1163)
at java.base/java.lang.ClassLoader.preDefineClass(ClassLoader.java:907)
at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1015)
at java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)
at java.base/java.net.URLClassLoader.defineClass(URLClassLoader.java:524)
at java.base/java.net.URLClassLoader$1.run(URLClassLoader.java:427)
at java.base/java.net.URLClassLoader$1.run(URLClassLoader.java:421)
at java.base/java.security.AccessController.doPrivileged(AccessController.java:712)
at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:420)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:592)
at io.trino.server.PluginClassLoader.loadClass(PluginClassLoader.java:128)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525)
at io.trino.parquet.reader.MetadataReader.readFooter(MetadataReader.java:118)
at io.trino.plugin.hive.parquet.ParquetPageSourceFactory.createPageSource(ParquetPageSourceFactory.java:230)
... 18 more
Hi @ebyhr There was one jar which was related to oci object store which i added in order to connect to object store due to which it was giving above error. I fixed it. Now it is giving the same error. (Trino 431)
io.trino.spi.TrinoException: Failed to read Parquet file: oci://bucket@idfoaqwbw7ew/test_trino_par/store_sales/ss_sold_date_sk=__HIVE_DEFAULT_PARTITION__/part-00042-eac89ce9-041a-4254-b90a-6aceb3c8e6c4.c000.snappy.parquet
at io.trino.plugin.hive.parquet.ParquetPageSource.handleException(ParquetPageSource.java:200)
at io.trino.plugin.hive.parquet.ParquetPageSourceFactory.lambda$createPageSource$1(ParquetPageSourceFactory.java:307)
at io.trino.parquet.reader.ParquetBlockFactory$ParquetBlockLoader.load(ParquetBlockFactory.java:75)
at io.trino.spi.block.LazyBlock$LazyData.load(LazyBlock.java:361)
at io.trino.spi.block.LazyBlock$LazyData.getFullyLoadedBlock(LazyBlock.java:340)
at io.trino.spi.block.LazyBlock.getLoadedBlock(LazyBlock.java:235)
at io.trino.spi.Page.getLoadedPage(Page.java:231)
at io.trino.operator.TableScanOperator.getOutput(TableScanOperator.java:305)
at io.trino.operator.Driver.processInternal(Driver.java:395)
at io.trino.operator.Driver.lambda$process$8(Driver.java:298)
at io.trino.operator.Driver.tryWithLock(Driver.java:694)
at io.trino.operator.Driver.process(Driver.java:290)
at io.trino.operator.Driver.processForDuration(Driver.java:261)
at io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:887)
at io.trino.execution.executor.timesharing.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:187)
at io.trino.execution.executor.timesharing.TimeSharingTaskExecutor$TaskRunner.run(TimeSharingTaskExecutor.java:565)
at io.trino.$gen.Trino_431____20231102_104429_2.run(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.lang.RuntimeException: java.io.IOException: can not read class org.apache.parquet.format.PageHeader: Required field 'uncompressed_page_size' was not found in serialized data! Struct: org.apache.parquet.format.PageHeader$PageHeaderStandardScheme@64507c98
at io.trino.parquet.reader.ParquetColumnChunkIterator.next(ParquetColumnChunkIterator.java:105)
at io.trino.parquet.reader.ParquetColumnChunkIterator.next(ParquetColumnChunkIterator.java:41)
at com.google.common.collect.Iterators$PeekingImpl.peek(Iterators.java:1218)
at io.trino.parquet.reader.PageReader.readDictionaryPage(PageReader.java:147)
at io.trino.parquet.reader.AbstractColumnReader.setPageReader(AbstractColumnReader.java:79)
at io.trino.parquet.reader.ParquetReader.readPrimitive(ParquetReader.java:435)
at io.trino.parquet.reader.ParquetReader.readColumnChunk(ParquetReader.java:524)
at io.trino.parquet.reader.ParquetReader.readBlock(ParquetReader.java:507)
at io.trino.parquet.reader.ParquetReader.lambda$nextPage$3(ParquetReader.java:272)
at io.trino.parquet.reader.ParquetBlockFactory$ParquetBlockLoader.load(ParquetBlockFactory.java:72)
... 17 more
Caused by: java.io.IOException: can not read class org.apache.parquet.format.PageHeader: Required field 'uncompressed_page_size' was not found in serialized data! Struct: org.apache.parquet.format.PageHeader$PageHeaderStandardScheme@64507c98
at org.apache.parquet.format.Util.read(Util.java:366)
at org.apache.parquet.format.Util.readPageHeader(Util.java:133)
at org.apache.parquet.format.Util.readPageHeader(Util.java:128)
at io.trino.parquet.reader.ParquetColumnChunkIterator.readPageHeader(ParquetColumnChunkIterator.java:112)
at io.trino.parquet.reader.ParquetColumnChunkIterator.next(ParquetColumnChunkIterator.java:79)
... 26 more
Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: Required field 'uncompressed_page_size' was not found in serialized data! Struct: org.apache.parquet.format.PageHeader$PageHeaderStandardScheme@64507c98
at org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1114)
at org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1025)
at org.apache.parquet.format.PageHeader.read(PageHeader.java:902)
at org.apache.parquet.format.Util.read(Util.java:363)
... 30 more
In the past people have mentioned that the Hadoop filesystem jars for some platforms don't deal with large files correctly. See https://github.com/trinodb/trino/issues/2256#issuecomment-640685774.
Does the error go away if you generate the same file but with smaller amount of data?
Also since you're using a custom filesystem JAR it's not something we can guarantee would work or not since we don't test against it.
@hashhar I dont think it is related to our jar, it is similar to s3 connector. we have oci connector https://github.com/oracle/oci-hdfs-connector. Open Source Spark is able to read the file as i mentioned in issue (with same oci-hdfs-connector) Issue is only happening with trino.
@hashhar Also this does not only happen when data is present in object store. It is reproducible when data is present in hdfs. No custom jar is required in that scenario. Open source trino is used to query hdfs.
I tried reproducing this locally and had no issue with reading the file shared here.
@suryanshagnihotri since it reproduces for you locally, you can try enabling debug logs for io.trino.parquet.reader
share output here or debug through IDE where this fails.
@raunaqmorarka thanks for checking. Sorry for stating that it fails in hdfs too , i got incorrect information and trusted it but when i personally tried it fails only on object store. Somehow it works when spark is used to query the data (even when present in object store)so thought that it is trino issue. But since it works in hdfs , it might be something with trino and our connector. Will have to check... I will try debugging as you mentioned.
Hi I generated tpcds data and uploaded in object store via spark (3.0.1 it has parquet version 1.10.x) . While querying with trino it fails with below exception.
In order to reproduce the issue here , i took the above file and created a table using that file. I am able to query or read the above file using spark. Above file is not corrupt as i am also able to run parquet-tools meta command on it.
Table Creation via spark :
Trino:
Trino Version : 389 Trino Connector : hive
I am attaching the parquet file :
https://drive.google.com/file/d/1QCTAa9EFtjnkE_ZBKT-YjKGFk0eruDy4/view?usp=sharing Let me know if anything else is required. Thanks