trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.22k stars 2.95k forks source link

trino fails reading parquet file #19590

Closed suryanshagnihotri closed 10 months ago

suryanshagnihotri commented 10 months ago

Hi I generated tpcds data and uploaded in object store via spark (3.0.1 it has parquet version 1.10.x) . While querying with trino it fails with below exception.

Query 20231031_074407_00018_yzs36 failed: Failed reading parquet data; source= oci://bucketTPCDSSF1000PQhdfs_2/store_sales/ss_sold_date_sk=__HIVE_DEFAULT_PARTITION__/part-00042-eac89ce9-041a-4254-b90a-6aceb3c8e6c4.c000.snappy.parquet; can not read class org.apache.parquet.format.PageHeader: Required field 'uncompressed_page_size' was not found in serialized data! Struct: org.apache.parquet.format.PageHeader$PageHeaderStandardScheme@49627eba
io.trino.spi.TrinoException: Failed reading parquet data; source= oci://bucketTPCDSSF1000PQhdfs_2/store_sales/ss_sold_date_sk=__HIVE_DEFAULT_PARTITION__/part-00042-eac89ce9-041a-4254-b90a-6aceb3c8e6c4.c000.snappy.parquet; can not read class org.apache.parquet.format.PageHeader: Required field 'uncompressed_page_size' was not found in serialized data! Struct: org.apache.parquet.format.PageHeader$PageHeaderStandardScheme@49627eba
    at io.trino.plugin.hive.parquet.ParquetPageSource$ParquetBlockLoader.load(ParquetPageSource.java:218)
    at io.trino.spi.block.LazyBlock$LazyData.load(LazyBlock.java:401)
    at io.trino.spi.block.LazyBlock$LazyData.getFullyLoadedBlock(LazyBlock.java:380)
    at io.trino.spi.block.LazyBlock.getLoadedBlock(LazyBlock.java:287)
    at io.trino.spi.Page.getLoadedPage(Page.java:288)
    at io.trino.operator.TableScanOperator.getOutput(TableScanOperator.java:314)
    at io.trino.operator.Driver.processInternal(Driver.java:410)
    at io.trino.operator.Driver.lambda$process$10(Driver.java:313)
    at io.trino.operator.Driver.tryWithLock(Driver.java:698)
    at io.trino.operator.Driver.process(Driver.java:305)
    at io.trino.operator.Driver.processForDuration(Driver.java:276)
    at io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:740)
    at io.trino.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:163)
    at io.trino.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:488)
    at io.trino.$gen.Trino_389_34_gfe9d03b____20231031_044138_2.run(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.io.IOException: can not read class org.apache.parquet.format.PageHeader: Required field 'uncompressed_page_size' was not found in serialized data! Struct: org.apache.parquet.format.PageHeader$PageHeaderStandardScheme@49627eba
    at org.apache.parquet.format.Util.read(Util.java:366)
    at org.apache.parquet.format.Util.readPageHeader(Util.java:133)
    at org.apache.parquet.format.Util.readPageHeader(Util.java:128)
    at io.trino.parquet.reader.ParquetColumnChunk.readPageHeader(ParquetColumnChunk.java:78)
    at io.trino.parquet.reader.ParquetColumnChunk.readAllPages(ParquetColumnChunk.java:91)
    at io.trino.parquet.reader.ParquetReader.createPageReader(ParquetReader.java:404)
    at io.trino.parquet.reader.ParquetReader.readPrimitive(ParquetReader.java:379)
    at io.trino.parquet.reader.ParquetReader.readColumnChunk(ParquetReader.java:460)
    at io.trino.parquet.reader.ParquetReader.readBlock(ParquetReader.java:443)
    at io.trino.plugin.hive.parquet.ParquetPageSource$ParquetBlockLoader.load(ParquetPageSource.java:212)
    ... 17 more
Caused by: io.trino.hive.$internal.parquet.org.apache.thrift.protocol.TProtocolException: Required field 'uncompressed_page_size' was not found in serialized data! Struct: org.apache.parquet.format.PageHeader$PageHeaderStandardScheme@49627eba
    at org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1114)
    at org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1025)
    at org.apache.parquet.format.PageHeader.read(PageHeader.java:902)
    at org.apache.parquet.format.Util.read(Util.java:363)
    ... 26 more

In order to reproduce the issue here , i took the above file and created a table using that file. I am able to query or read the above file using spark. Above file is not corrupt as i am also able to run parquet-tools meta command on it.

scala> val parquetFilePath = "file:///tmp/part-00042-eac89ce9-041a-4254-b90a-6aceb3c8e6c4.c000.snappy.parquet"
parquetFilePath: String = file:///tmp/part-00042-eac89ce9-041a-4254-b90a-6aceb3c8e6c4.c000.snappy.parquet

scala> val df = spark.read.parquet(parquetFilePath)
df: org.apache.spark.sql.DataFrame = [ss_sold_time_sk: int, ss_item_sk: int ... 20 more fields]

scala> df.show()
+---------------+----------+--------------+-----------+-----------+----------+-----------+-----------+----------------+-----------+-----------------+-------------+--------------+-------------------+------------------+---------------------+-----------------+----------+-------------+-----------+-------------------+-------------+
|ss_sold_time_sk|ss_item_sk|ss_customer_sk|ss_cdemo_sk|ss_hdemo_sk|ss_addr_sk|ss_store_sk|ss_promo_sk|ss_ticket_number|ss_quantity|ss_wholesale_cost|ss_list_price|ss_sales_price|ss_ext_discount_amt|ss_ext_sales_price|ss_ext_wholesale_cost|ss_ext_list_price|ss_ext_tax|ss_coupon_amt|ss_net_paid|ss_net_paid_inc_tax|ss_net_profit|
+---------------+----------+--------------+-----------+-----------+----------+-----------+-----------+----------------+-----------+-----------------+-------------+--------------+-------------------+------------------+---------------------+-----------------+----------+-------------+-----------+-------------------+-------------+
|          45619|    155593|       5071547|    1483983|        772|   5356408|       null|        110|        43200003|       null|             null|       106.15|          null|               null|           7040.88|              6110.28|         10508.85|    211.22|         null|       null|               null|         null|
|           null|    215125|      10107033|    1917032|       5431|   4905383|        121|       null|        11870476|       null|             null|         null|          null|               null|              2.51|                 null|            83.71|      0.02|         null|       null|               null|         null|
|           null|    115231|          null|       null|       null|   2237449|        169|        535|        43200004|         85|            82.74|       162.99|         99.42|               0.00|              null|              7032.90|             null|    591.54|         0.00|       null|            9042.24|         null|
|           null|     18697|          null|     956706|       null|      null|       null|       1344|        11870480|         46|            41.35|         null|          null|               null|           1935.22|              1902.10|             null|      null|         null|    1935.22|            2012.62|         null|
|          69966|    169000|      11651775|    1125275|       null|      null|        500|       null|        43200006|       null|            51.15|        71.61|          2.86|               0.00|              null|                 null|             null|      2.51|         0.00|     125.84|             128.35|         null|
|          69900|    289321|          null|     956706|       null|      null|        602|        106|        11870480|         41|            25.32|        35.95|         16.53|               null|              null|                 null|             null|      null|         null|       null|             677.73|      -360.39|
|          47540|    278314|       7441545|       null|       4904|   4741960|       null|       null|        43200007|         29|             null|         null|        189.38|               0.00|              null|              2830.98|             null|      null|         0.00|       null|            5986.30|         null|
|          62229|    256558|          null|       null|       null|      null|        674|       null|        11870491|       null|             null|         null|         24.36|               0.00|              null|                 null|             null|      null|         0.00|       null|             442.86|      -352.80|
|           null|     41467|       9008168|     930974|       4180|   4806986|       null|          4|        43200009|       null|             null|         null|          null|             533.22|              null|                 null|          1127.40|      null|       533.22|       null|             440.64|      -291.12|
|          66887|    282160|       4214130|    1685206|       3929|   1635805|       null|         88|        11870492|         10|             null|        74.40|          null|               null|              0.00|               442.90|           744.00|      0.00|         null|       null|               null|         null|
|          51243|    207285|          null|    1592195|       2348|   2370548|       null|       1194|        43200010|       null|            69.66|         null|          null|               0.00|              null|               208.98|             null|      0.00|         0.00|       null|               null|      -122.49|
|           null|     54925|          null|       null|       1989|   4218686|       null|       null|        11870494|         14|             null|        87.73|         14.91|               null|            208.74|              1086.96|          1228.22|      2.08|         null|       null|             210.82|         null|
|           null|    149935|       6884524|     542115|       1306|      null|       null|       1289|        43200011|       null|            58.04|         null|          null|               0.00|           1508.15|              2031.40|             null|      null|         0.00|       null|               null|         null|
|          43844|    268885|      10153721|     861993|       null|      null|       null|        349|        11870498|       null|            47.40|        80.10|         74.49|               null|              null|                 null|             null|      null|         null|       null|            6934.27|      2356.83|
|          55317|    194449|       1169756|    1010171|       null|    232601|       null|        868|        43200012|       null|             1.76|         1.98|          1.30|               null|             29.90|                 null|            45.54|      null|         null|       null|               null|         null|
|           null|    113248|      11053247|     764930|       null|      null|        253|        239|        11870501|       null|             null|         null|          null|               null|              null|                 null|             null|      0.00|         null|       null|              84.96|       -67.86|
|           null|     21235|       1169756|    1010171|       null|      null|        157|       null|        43200012|       null|             null|         null|          null|               null|              null|                 null|           834.36|      null|         null|     558.96|               null|         null|
|           null|    145123|          null|       null|       null|      null|       null|       null|        11870506|       null|            21.31|         null|         19.41|               null|           1843.95|              2024.45|             null|      null|         null|      18.44|              18.62|         null|
|           null|     52555|       1169756|    1010171|       null|    232601|        157|       null|        43200012|         75|             null|         null|         82.90|               null|              null|                 null|             null|     62.17|         null|       null|               null|         null|
|           null|     87405|          null|    1217152|       6485|   3025418|        512|        336|        11870507|       null|            17.35|         null|          null|               0.00|              null|                 null|           684.11|      8.18|         0.00|     102.37|             110.55|      -400.78|
+---------------+----------+--------------+-----------+-----------+----------+-----------+-----------+----------------+-----------+-----------------+-------------+--------------+-------------------+------------------+---------------------+-----------------+----------+-------------+-----------+-------------------+-------------+
only showing top 20 rows

Table Creation via spark :

CREATE TABLE `trino_test1` (
  `ss_sold_time_sk` INT,
  `ss_item_sk` INT,
  `ss_customer_sk` INT,
  `ss_cdemo_sk` INT,
  `ss_hdemo_sk` INT,
  `ss_addr_sk` INT,
  `ss_store_sk` INT,
  `ss_promo_sk` INT,
  `ss_ticket_number` BIGINT,
  `ss_quantity` INT,
  `ss_wholesale_cost` DECIMAL(7,2),
  `ss_list_price` DECIMAL(7,2),
  `ss_sales_price` DECIMAL(7,2),
  `ss_ext_discount_amt` DECIMAL(7,2),
  `ss_ext_sales_price` DECIMAL(7,2),
  `ss_ext_wholesale_cost` DECIMAL(7,2),
  `ss_ext_list_price` DECIMAL(7,2),
  `ss_ext_tax` DECIMAL(7,2),
  `ss_coupon_amt` DECIMAL(7,2),
  `ss_net_paid` DECIMAL(7,2),
  `ss_net_paid_inc_tax` DECIMAL(7,2),
  `ss_net_profit` DECIMAL(7,2),
  `ss_sold_date_sk` INT)
STORED AS PARQUET
PARTITIONED BY (ss_sold_date_sk)
LOCATION 'oci://bucket/test_trino_par/store_sales';

LOAD DATA LOCAL INPATH '/tmp/part-00042-eac89ce9-041a-4254-b90a-6aceb3c8e6c4.c000.snappy.parquet' OVERWRITE INTO TABLE trino_test1 PARTITION (ss_sold_date_sk = '__HIVE_DEFAULT_PARTITION__');

Trino:

trino:default> select * from trino_test1;

Query 20231031_084847_00037_yzs36, FAILED, 6 nodes
http://10.1.226.39:8285/ui/query.html?20231031_084847_00037_yzs36
Splits: 150 total, 2 done (1.33%)
CPU Time: 0.0s total,     0 rows/s,     0B/s, 14% active
Per Node: 0.0 parallelism,     0 rows/s,     0B/s
Parallelism: 0.1
Peak Memory: 0B
0.50 [0 rows, 0B] [0 rows/s, 0B/s]

Query 20231031_084847_00037_yzs36 failed: Failed reading parquet data; source= oci://bucket/test_trino_par/store_sales/ss_sold_date_sk=__HIVE_DEFAULT_PARTITION__/part-00042-eac89ce9-041a-4254-b90a-6aceb3c8e6c4.c000.snappy.parquet; can not read class org.apache.parquet.format.PageHeader: Required field 'uncompressed_page_size' was not found in serialized data! Struct: org.apache.parquet.format.PageHeader$PageHeaderStandardScheme@4dea7500
io.trino.spi.TrinoException: Failed reading parquet data; source= oci://bucket/test_trino_par/store_sales/ss_sold_date_sk=__HIVE_DEFAULT_PARTITION__/part-00042-eac89ce9-041a-4254-b90a-6aceb3c8e6c4.c000.snappy.parquet; can not read class org.apache.parquet.format.PageHeader: Required field 'uncompressed_page_size' was not found in serialized data! Struct: org.apache.parquet.format.PageHeader$PageHeaderStandardScheme@4dea7500
    at io.trino.plugin.hive.parquet.ParquetPageSource$ParquetBlockLoader.load(ParquetPageSource.java:218)
    at io.trino.spi.block.LazyBlock$LazyData.load(LazyBlock.java:401)
    at io.trino.spi.block.LazyBlock$LazyData.getFullyLoadedBlock(LazyBlock.java:380)
    at io.trino.spi.block.LazyBlock.getLoadedBlock(LazyBlock.java:287)
    at io.trino.spi.Page.getLoadedPage(Page.java:288)
    at io.trino.operator.TableScanOperator.getOutput(TableScanOperator.java:314)
    at io.trino.operator.Driver.processInternal(Driver.java:410)
    at io.trino.operator.Driver.lambda$process$10(Driver.java:313)
    at io.trino.operator.Driver.tryWithLock(Driver.java:698)
    at io.trino.operator.Driver.process(Driver.java:305)
    at io.trino.operator.Driver.processForDuration(Driver.java:276)
    at io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:740)
    at io.trino.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:163)
    at io.trino.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:488)
    at io.trino.$gen.Trino_389_34_gfe9d03b____20231031_044139_2.run(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.io.IOException: can not read class org.apache.parquet.format.PageHeader: Required field 'uncompressed_page_size' was not found in serialized data! Struct: org.apache.parquet.format.PageHeader$PageHeaderStandardScheme@4dea7500
    at org.apache.parquet.format.Util.read(Util.java:366)
    at org.apache.parquet.format.Util.readPageHeader(Util.java:133)
    at org.apache.parquet.format.Util.readPageHeader(Util.java:128)
    at io.trino.parquet.reader.ParquetColumnChunk.readPageHeader(ParquetColumnChunk.java:78)
    at io.trino.parquet.reader.ParquetColumnChunk.readAllPages(ParquetColumnChunk.java:91)
    at io.trino.parquet.reader.ParquetReader.createPageReader(ParquetReader.java:404)
    at io.trino.parquet.reader.ParquetReader.readPrimitive(ParquetReader.java:379)
    at io.trino.parquet.reader.ParquetReader.readColumnChunk(ParquetReader.java:460)
    at io.trino.parquet.reader.ParquetReader.readBlock(ParquetReader.java:443)
    at io.trino.plugin.hive.parquet.ParquetPageSource$ParquetBlockLoader.load(ParquetPageSource.java:212)
    ... 17 more
Caused by: io.trino.hive.$internal.parquet.org.apache.thrift.protocol.TProtocolException: Required field 'uncompressed_page_size' was not found in serialized data! Struct: org.apache.parquet.format.PageHeader$PageHeaderStandardScheme@4dea7500
    at org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1114)
    at org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1025)
    at org.apache.parquet.format.PageHeader.read(PageHeader.java:902)
    at org.apache.parquet.format.Util.read(Util.java:363)
    ... 26 more

Trino Version : 389 Trino Connector : hive

bash-4.2$ cat /etc/trino/conf/catalog/hive.properties
hive.max-partitions-per-writers=2500
hive.partition-statistics-sample-size=50
hive.allow-drop-table=true
connector.name=hive
hive.metastore.uri=thrift://trinoprun0.subnetpoc1.vcn12231050.oraclevcn.com:9083
hive.config.resources=/etc/hadoop/conf/core-site.xml,/etc/hadoop/conf/hdfs-site.xml
hive.non-managed-table-creates-enabled=true
hive.non-managed-table-writes-enabled=true
hive.metastore-timeout=1800s

I am attaching the parquet file :

https://drive.google.com/file/d/1QCTAa9EFtjnkE_ZBKT-YjKGFk0eruDy4/view?usp=sharing Let me know if anything else is required. Thanks

ebyhr commented 10 months ago

Trino Version : 389

Can you make sure if the issue still happens with the latest version?

suryanshagnihotri commented 10 months ago

Hi @ebyhr I used trino latest docker image to run the test , it is on 431 version. In hive.properties i changed the metastore uri to point to my metastore so that i can see the table. After running the query i do not see the above error but it does fail with different error

io.trino.spi.TrinoException: Error opening Hive split oci://bucket/test_trino_par/store_sales/ss_sold_date_sk=__HIVE_DEFAULT_PARTITION__/part-00042-eac89ce9-041a-4254-b90a-6aceb3c8e6c4.c000.snappy.parquet (offset=33554432, length=33554432): class "shaded.parquet.org.apache.thrift.transport.TMemoryBuffer"'s signer information does not match signer information of other classes in the same package
    at io.trino.plugin.hive.parquet.ParquetPageSourceFactory.createPageSource(ParquetPageSourceFactory.java:331)
    at io.trino.plugin.hive.parquet.ParquetPageSourceFactory.createPageSource(ParquetPageSourceFactory.java:185)
    at io.trino.plugin.hive.HivePageSourceProvider.createHivePageSource(HivePageSourceProvider.java:202)
    at io.trino.plugin.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:137)
    at io.trino.plugin.base.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:48)
    at io.trino.split.PageSourceManager.createPageSource(PageSourceManager.java:61)
    at io.trino.operator.TableScanOperator.getOutput(TableScanOperator.java:299)
    at io.trino.operator.Driver.processInternal(Driver.java:395)
    at io.trino.operator.Driver.lambda$process$8(Driver.java:298)
    at io.trino.operator.Driver.tryWithLock(Driver.java:694)
    at io.trino.operator.Driver.process(Driver.java:290)
    at io.trino.operator.Driver.processForDuration(Driver.java:261)
    at io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:887)
    at io.trino.execution.executor.timesharing.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:187)
    at io.trino.execution.executor.timesharing.TimeSharingTaskExecutor$TaskRunner.run(TimeSharingTaskExecutor.java:565)
    at io.trino.$gen.Trino_431____20231031_120106_2.run(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.lang.SecurityException: class "shaded.parquet.org.apache.thrift.transport.TMemoryBuffer"'s signer information does not match signer information of other classes in the same package
    at java.base/java.lang.ClassLoader.checkCerts(ClassLoader.java:1163)
    at java.base/java.lang.ClassLoader.preDefineClass(ClassLoader.java:907)
    at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1015)
    at java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)
    at java.base/java.net.URLClassLoader.defineClass(URLClassLoader.java:524)
    at java.base/java.net.URLClassLoader$1.run(URLClassLoader.java:427)
    at java.base/java.net.URLClassLoader$1.run(URLClassLoader.java:421)
    at java.base/java.security.AccessController.doPrivileged(AccessController.java:712)
    at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:420)
    at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:592)
    at io.trino.server.PluginClassLoader.loadClass(PluginClassLoader.java:128)
    at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525)
    at io.trino.parquet.reader.MetadataReader.readFooter(MetadataReader.java:118)
    at io.trino.plugin.hive.parquet.ParquetPageSourceFactory.createPageSource(ParquetPageSourceFactory.java:230)
    ... 18 more
suryanshagnihotri commented 10 months ago

Hi @ebyhr There was one jar which was related to oci object store which i added in order to connect to object store due to which it was giving above error. I fixed it. Now it is giving the same error. (Trino 431)

io.trino.spi.TrinoException: Failed to read Parquet file: oci://bucket@idfoaqwbw7ew/test_trino_par/store_sales/ss_sold_date_sk=__HIVE_DEFAULT_PARTITION__/part-00042-eac89ce9-041a-4254-b90a-6aceb3c8e6c4.c000.snappy.parquet
    at io.trino.plugin.hive.parquet.ParquetPageSource.handleException(ParquetPageSource.java:200)
    at io.trino.plugin.hive.parquet.ParquetPageSourceFactory.lambda$createPageSource$1(ParquetPageSourceFactory.java:307)
    at io.trino.parquet.reader.ParquetBlockFactory$ParquetBlockLoader.load(ParquetBlockFactory.java:75)
    at io.trino.spi.block.LazyBlock$LazyData.load(LazyBlock.java:361)
    at io.trino.spi.block.LazyBlock$LazyData.getFullyLoadedBlock(LazyBlock.java:340)
    at io.trino.spi.block.LazyBlock.getLoadedBlock(LazyBlock.java:235)
    at io.trino.spi.Page.getLoadedPage(Page.java:231)
    at io.trino.operator.TableScanOperator.getOutput(TableScanOperator.java:305)
    at io.trino.operator.Driver.processInternal(Driver.java:395)
    at io.trino.operator.Driver.lambda$process$8(Driver.java:298)
    at io.trino.operator.Driver.tryWithLock(Driver.java:694)
    at io.trino.operator.Driver.process(Driver.java:290)
    at io.trino.operator.Driver.processForDuration(Driver.java:261)
    at io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:887)
    at io.trino.execution.executor.timesharing.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:187)
    at io.trino.execution.executor.timesharing.TimeSharingTaskExecutor$TaskRunner.run(TimeSharingTaskExecutor.java:565)
    at io.trino.$gen.Trino_431____20231102_104429_2.run(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.lang.RuntimeException: java.io.IOException: can not read class org.apache.parquet.format.PageHeader: Required field 'uncompressed_page_size' was not found in serialized data! Struct: org.apache.parquet.format.PageHeader$PageHeaderStandardScheme@64507c98
    at io.trino.parquet.reader.ParquetColumnChunkIterator.next(ParquetColumnChunkIterator.java:105)
    at io.trino.parquet.reader.ParquetColumnChunkIterator.next(ParquetColumnChunkIterator.java:41)
    at com.google.common.collect.Iterators$PeekingImpl.peek(Iterators.java:1218)
    at io.trino.parquet.reader.PageReader.readDictionaryPage(PageReader.java:147)
    at io.trino.parquet.reader.AbstractColumnReader.setPageReader(AbstractColumnReader.java:79)
    at io.trino.parquet.reader.ParquetReader.readPrimitive(ParquetReader.java:435)
    at io.trino.parquet.reader.ParquetReader.readColumnChunk(ParquetReader.java:524)
    at io.trino.parquet.reader.ParquetReader.readBlock(ParquetReader.java:507)
    at io.trino.parquet.reader.ParquetReader.lambda$nextPage$3(ParquetReader.java:272)
    at io.trino.parquet.reader.ParquetBlockFactory$ParquetBlockLoader.load(ParquetBlockFactory.java:72)
    ... 17 more
Caused by: java.io.IOException: can not read class org.apache.parquet.format.PageHeader: Required field 'uncompressed_page_size' was not found in serialized data! Struct: org.apache.parquet.format.PageHeader$PageHeaderStandardScheme@64507c98
    at org.apache.parquet.format.Util.read(Util.java:366)
    at org.apache.parquet.format.Util.readPageHeader(Util.java:133)
    at org.apache.parquet.format.Util.readPageHeader(Util.java:128)
    at io.trino.parquet.reader.ParquetColumnChunkIterator.readPageHeader(ParquetColumnChunkIterator.java:112)
    at io.trino.parquet.reader.ParquetColumnChunkIterator.next(ParquetColumnChunkIterator.java:79)
    ... 26 more
Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: Required field 'uncompressed_page_size' was not found in serialized data! Struct: org.apache.parquet.format.PageHeader$PageHeaderStandardScheme@64507c98
    at org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1114)
    at org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1025)
    at org.apache.parquet.format.PageHeader.read(PageHeader.java:902)
    at org.apache.parquet.format.Util.read(Util.java:363)
    ... 30 more
hashhar commented 10 months ago

In the past people have mentioned that the Hadoop filesystem jars for some platforms don't deal with large files correctly. See https://github.com/trinodb/trino/issues/2256#issuecomment-640685774.

Does the error go away if you generate the same file but with smaller amount of data?

Also since you're using a custom filesystem JAR it's not something we can guarantee would work or not since we don't test against it.

suryanshagnihotri commented 10 months ago

@hashhar I dont think it is related to our jar, it is similar to s3 connector. we have oci connector https://github.com/oracle/oci-hdfs-connector. Open Source Spark is able to read the file as i mentioned in issue (with same oci-hdfs-connector) Issue is only happening with trino.

suryanshagnihotri commented 10 months ago

@hashhar Also this does not only happen when data is present in object store. It is reproducible when data is present in hdfs. No custom jar is required in that scenario. Open source trino is used to query hdfs.

raunaqmorarka commented 10 months ago

I tried reproducing this locally and had no issue with reading the file shared here. @suryanshagnihotri since it reproduces for you locally, you can try enabling debug logs for io.trino.parquet.reader share output here or debug through IDE where this fails.

suryanshagnihotri commented 10 months ago

@raunaqmorarka thanks for checking. Sorry for stating that it fails in hdfs too , i got incorrect information and trusted it but when i personally tried it fails only on object store. Somehow it works when spark is used to query the data (even when present in object store)so thought that it is trino issue. But since it works in hdfs , it might be something with trino and our connector. Will have to check... I will try debugging as you mentioned.