prestodb / presto

The official home of the Presto distributed SQL query engine for big data
http://prestodb.io
Apache License 2.0
15.96k stars 5.35k forks source link

Is there a way to tell Presto to skip invalid parquet files in query ? #14652

Open varunbpatil opened 4 years ago

varunbpatil commented 4 years ago

I'm using Presto 0.236 with Hive connector.

presto:default> SELECT Time FROM mytable WHERE date_ >= 20200613 ORDER BY Time;

Query 20200615_151944_00029_cyskg, FAILED, 1 node
Splits: 118 total, 0 done (0.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]

Query 20200615_151944_00029_cyskg failed: hdfs://x.x.x.x:8020/abc/test.parquet is not a valid Parquet File

The error is correct in that the above parquet file - test.parquet - is invalid.

But, I want Presto to be able to skip such files in the query. Is this possible?

I tried an older version 0.181 and it does skip invalid parquet files, but I want some features in the newer Presto version and was wondering if there is a flag for this.

Ravion commented 4 years ago

An easy solution I can think of is to write a custom function which ignores invalid files and aggregates the results from all valid files, definitely expensive though.

Best, Ravi

On Mon, Jun 15, 2020, 11:25 AM Varun B Patil notifications@github.com wrote:

I'm using Presto 0.236 with Hive connector.

presto:default> SELECT Time FROM mytable WHERE date_ >= 20200613 ORDER BY Time;

Query 20200615_151944_00029_cyskg, FAILED, 1 node Splits: 118 total, 0 done (0.00%) 0:00 [0 rows, 0B] [0 rows/s, 0B/s]

Query 20200615_151944_00029_cyskg failed: hdfs://x.x.x.x:8020/abc/test.parquet is not a valid Parquet File

The error is correct in that the above parquet file - test.parquet - is invalid.

But, I want Presto to be able to skip such files in the query. Is this possible?

I tried an older version 0.181 and it does skip invalid parquet files, but I want some features in the newer Presto version and was wondering if there is a flag for this.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/prestodb/presto/issues/14652, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3TLDB7TH4E3SCKHQHOJITRWY4NNANCNFSM4N6JDGYQ .