Closed tooptoop4 closed 4 years ago
@arhimondr @rschlussel any idea?
@zhenxiao Could you please have a look?
Hi @tooptoop4 The error message saying, the Parquet file is corrupted:
Not valid Parquet file: s3a://redact/s/t/temp2m/4071ae29-84f1-4d32-a691-fd7ab991be95_0_20190923125717.parquet expected magic number: [80, 65, 82, 49] got: [-11, 0, -66, 105]
Actually, seems file's magic number does not match Parquet.
could you please create a table with only one file: s3a://redact/s/t/temp2m/4071ae29-84f1-4d32-a691-fd7ab991be95_0_20190923125717.parquet
And try both Presto versions?
My current guess, is the parquet file is corrupted.
@zhenxiao I think more likely is that 221 release introduced a bug because prestosql 319 release is able to query this parquet
Hi @tooptoop4 I did not see code change in Parquet path in 221 release. From the error message and stack, seems the Parquet file is corrupted.
Could you please send me ur parquet file? I could try reproduce locally.
my email: rooservelt.luo@gmail.com
@zhenxiao https://github.com/prestodb/presto/issues/12832 mentions several changes about Hive Partitions/Splits --> ef999c0 Improve InternalHiveSplit memory usage estimate 5985ec6 Fix HiveQueryRunner.createBucketedSession 99e4cb5 Move getPathDomain to HiveSplitManager 029aa84 Extract partition info from InternalHiveSplit 44be99d Set useRewindableSplit in the HiveSplitSourceConstructor 62fbcd6 Remove start field from InternalHiveBlock 7fbedba Encode InternalHiveBlocks as a list and array 891c1c7 Don't store Hive split addresses if not needed b6336ac Fix formatting in BackgroundHiveSplitLoader ca48f65 Encode bucket numbers as int in InternalHiveSplit d638284 Refactor HiveSplitManager db9aabb Encode path as byte array in InternalHiveSplit 63feb11 Refactor stateReference initialization in HiveSplitSource 8d1894b Add integration test for staging partition in AbstractTestHiveClient 8f214a1 Add SUPPORTS_REWINDABLE_SPLIT_SOURCE to Hive connector capabilities 4a8f9b8 Add base path to partition info 481cd70 Introduce SplitSchedulingContext in getSplits() SPI 8f214a1 Add SUPPORTS_REWINDABLE_SPLIT_SOURCE to Hive connector capabilities
@tooptoop4 the parquet scan code is untouched, and has been untouched for a while. Could you please share ur parquet file? I could take a look.
@tooptoop4 did u fix this bug?
No. I'm stuck on 220
do u face the issue too @iceted ?
maybe related to https://issues.apache.org/jira/browse/HUDI-409, https://github.com/apache/incubator-hudi/issues/1384#issuecomment-597750294 mentions "Can this have some relation with https://issues.apache.org/jira/browse/HUDI-409 , as we recently encountered parquet corruption errors (magic numbers mismatch) while reading from presto on a fresh hudi table, and there were no errors/warn reported by spark or in hudi commit metadata files."
It could be due to a bug that was just fixed by https://github.com/prestodb/presto/pull/14355. should be in the next Presto release.
that must be it, https://github.com/prestodb/presto/pull/12780 went in 221 release causing it!
nice catch, thank you @rschlussel
presto v0.220 (works): java -jar /home/ec2-user/presto --server http://localhost:4038 --execute "select metric, sum(1e-6) as mrows FROM hive.s.t group by metric order by metric" --output-format TSV --user x temp2m 434.17728003662273
This issue must have been introduced in 0.221 release (every release after 0.220 has the issue)
presto v0.221 (broken): java -jar /home/ec2-user/presto --server http://localhost:4038 --execute "select metric, sum(1e-6) as mrows FROM hive.s.t group by metric order by metric" --output-format TSV --user x Query 20190926_133301_06326_e38kd failed: Not valid Parquet file: s3a://redact/s/t/temp2m/4071ae29-84f1-4d32-a691-fd7ab991be95_0_20190923125717.parquet expected magic number: [80, 65, 82, 49] got: [-11, 0, -66, 105]
presto v0.226 (broken): java -jar /home/ec2-user/presto --server http://localhost:4038 --execute "select metric, sum(1e-6) as mrows FROM hive.s.t group by metric order by metric" --output-format TSV --user x Query 20190926_133301_06326_e38kd failed: Not valid Parquet file: s3a://redact/s/t/temp2m/4071ae29-84f1-4d32-a691-fd7ab991be95_0_20190923125717.parquet expected magic number: [80, 65, 82, 49] got: [-11, 0, -66, 105]
com.facebook.presto.spi.PrestoException: Not valid Parquet file: s3a://redact/s/t/temp2m/0f330796-e679-4ce6-9f42-8a99834b5536_0_20190925041855.parquet expected magic number: [80, 65, 82, 49] got: [-89, -61, -38, -82] at com.facebook.presto.hive.parquet.ParquetPageSourceFactory.createParquetPageSource(ParquetPageSourceFactory.java:245) at com.facebook.presto.hive.parquet.ParquetPageSourceFactory.createPageSource(ParquetPageSourceFactory.java:145) at com.facebook.presto.hive.HivePageSourceProvider.createHivePageSource(HivePageSourceProvider.java:240) at com.facebook.presto.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:114) at com.facebook.presto.spi.connector.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:51) at com.facebook.presto.split.PageSourceManager.createPageSource(PageSourceManager.java:58) at com.facebook.presto.operator.ScanFilterAndProjectOperator.getOutput(ScanFilterAndProjectOperator.java:225) at com.facebook.presto.operator.Driver.processInternal(Driver.java:379) at com.facebook.presto.operator.Driver.lambda$processFor$8(Driver.java:283) at com.facebook.presto.operator.Driver.tryWithLock(Driver.java:675) at com.facebook.presto.operator.Driver.processFor(Driver.java:276) at com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1077) at com.facebook.presto.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:162) at com.facebook.presto.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:483) at com.facebook.presto.$gen.Presto_0_226_dirty__0_226____20190926_104426_1.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: com.facebook.presto.parquet.ParquetCorruptionException: Not valid Parquet file: s3a://redact/s/t/temp2m/0f330796-e679-4ce6-9f42-8a99834b5536_0_20190925041855.parquet expected magic number: [80, 65, 82, 49] got: [-89, -61, -38, -82] at com.facebook.presto.parquet.ParquetValidationUtils.validateParquet(ParquetValidationUtils.java:28) at com.facebook.presto.parquet.reader.MetadataReader.readFooter(MetadataReader.java:97) at com.facebook.presto.hive.parquet.ParquetPageSourceFactory.createParquetPageSource(ParquetPageSourceFactory.java:184) ... 17 more
metric varchar, lat bigint, lon bigint, for_dt varchar, value double, metric_p varchar ) WITH ( external_location = 's3a://redact', format = 'PARQUET', partitioned_by = ARRAY['metric_p'] )
The query returns a different parquet filename affected each time
ls -l 98e609f0-f77b-4ddf-a186-7b75eb02cedf_0_20190923174642.parquet -rw-rw-rw-+ 1 ec2-user ec2-user 107219431 Sep 23 17:48 98e609f0-f77b-4ddf-a186-7b75eb02cedf_0_20190923174642.parquet
_hoodie_commit_time = 20190923165529 _hoodie_commit_seqno = 20190923165529_0_1 _hoodie_record_key = temp2m#|#59#|#292#|#1979-06-17 04:00:00 _hoodie_partition_path = temp2m _hoodie_file_name = 98e609f0-f77b-4ddf-a186-7b75eb02cedf_0_20190923165529.parquet metric = temp2m lat = 59 lon = 292 for_dt = 1979-06-17 04:00:00 value = 274.6221330923361 _row_key = temp2m#|#59#|#292#|#1979-06-17 04:00:00 mypart = temp2m
_hoodie_commit_time = 20190923165529 _hoodie_commit_seqno = 20190923165529_0_2 _hoodie_record_key = temp2m#|#59#|#300#|#1979-06-21 16:00:00 _hoodie_partition_path = temp2m _hoodie_file_name = 98e609f0-f77b-4ddf-a186-7b75eb02cedf_0_20190923165529.parquet metric = temp2m lat = 59 lon = 300 for_dt = 1979-06-21 16:00:00 value = 275.6606829875523 _row_key = temp2m#|#59#|#300#|#1979-06-21 16:00:00 mypart = temp2m
_hoodie_commit_time = 20190923165529 _hoodie_commit_seqno = 20190923165529_0_3 _hoodie_record_key = temp2m#|#59#|#319#|#1979-06-04 13:00:00 _hoodie_partition_path = temp2m _hoodie_file_name = 98e609f0-f77b-4ddf-a186-7b75eb02cedf_0_20190923165529.parquet metric = temp2m lat = 59 lon = 319 for_dt = 1979-06-04 13:00:00 value = 277.17394318935055 _row_key = temp2m#|#59#|#319#|#1979-06-04 13:00:00 mypart = temp2m
_hoodie_commit_time = 20190923165529 _hoodie_commit_seqno = 20190923165529_0_4 _hoodie_record_key = temp2m#|#59#|#190#|#1979-06-01 02:00:00 _hoodie_partition_path = temp2m _hoodie_file_name = 98e609f0-f77b-4ddf-a186-7b75eb02cedf_0_20190923165529.parquet metric = temp2m lat = 59 lon = 190 for_dt = 1979-06-01 02:00:00 value = 276.2245225761865 _row_key = temp2m#|#59#|#190#|#1979-06-01 02:00:00 mypart = temp2m
_hoodie_commit_time = 20190923165529 _hoodie_commit_seqno = 20190923165529_0_5 _hoodie_record_key = temp2m#|#59#|#171#|#1979-06-03 12:00:00 _hoodie_partition_path = temp2m _hoodie_file_name = 98e609f0-f77b-4ddf-a186-7b75eb02cedf_0_20190923165529.parquet metric = temp2m lat = 59 lon = 171 for_dt = 1979-06-03 12:00:00 value = 277.17588078243864 _row_key = temp2m#|#59#|#171#|#1979-06-03 12:00:00 mypart = temp2m
/usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/java -jar parquet-tools-1.9.0.jar meta 98e609f0-f77b-4ddf-a186-7b75eb02cedf_0_20190923174642.parquet
//////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////Pw== extra: hoodie_min_record_key = temp2m#|#36#|#0#|#1979-06-01 00:00:00 extra: parquet.avro.schema = {"type":"record","name":"t_record","namespace":"s","fields":[{"name":"_hoodie_commit_time","type":["null","string"],"doc":""},{"name":"_hoodie_commit_seqno","type":["null","string"],"doc":""},{"name":"_hoodie_record_key","type":["null","string"],"doc":""},{"name":"_hoodie_partition_path","type":["null","string"],"doc":""},{"name":"_hoodie_file_name","type":["null","string"],"doc":""},{"name":"metric","type":["string","null"]},{"name":"lat","type":["long","null"]},{"name":"lon","type":["long","null"]},{"name":"for_dt","type":["string","null"]},{"name":"value","type":["double","null"]},{"name":"_row_key","type":"string"},{"name":"mypart","type":["string","null"]}]} extra: hoodie_max_record_key = temp2m#|#60#|#99#|#1979-07-31 23:00:00
file schema: s.t_record
_hoodie_commit_time: OPTIONAL BINARY O:UTF8 R:0 D:1 _hoodie_commit_seqno: OPTIONAL BINARY O:UTF8 R:0 D:1 _hoodie_record_key: OPTIONAL BINARY O:UTF8 R:0 D:1 _hoodie_partition_path: OPTIONAL BINARY O:UTF8 R:0 D:1 _hoodie_file_name: OPTIONAL BINARY O:UTF8 R:0 D:1 metric: OPTIONAL BINARY O:UTF8 R:0 D:1 lat: OPTIONAL INT64 R:0 D:1 lon: OPTIONAL INT64 R:0 D:1 for_dt: OPTIONAL BINARY O:UTF8 R:0 D:1 value: OPTIONAL DOUBLE R:0 D:1 _row_key: REQUIRED BINARY O:UTF8 R:0 D:0 mypart: OPTIONAL BINARY O:UTF8 R:0 D:1
row group 1: RC:6488640 TS:754711984 OFFSET:4
_hoodie_commit_time: BINARY GZIP DO:0 FPO:4 SZ:9889/8167/0.83 VC:6488640 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE _hoodie_commit_seqno: BINARY GZIP DO:0 FPO:9893 SZ:16938558/172429678/10.18 VC:6488640 ENC:BIT_PACKED,RLE,PLAIN _hoodie_record_key: BINARY GZIP DO:0 FPO:16948451 SZ:30911549/277059352/8.96 VC:6488640 ENC:BIT_PACKED,RLE,PLAIN _hoodie_partition_path: BINARY GZIP DO:0 FPO:47860000 SZ:4321/3185/0.74 VC:6488640 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE _hoodie_file_name: BINARY GZIP DO:0 FPO:47864321 SZ:72799/66978/0.92 VC:6488640 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE metric: BINARY GZIP DO:0 FPO:47937120 SZ:4321/3185/0.74 VC:6488640 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE lat: INT64 GZIP DO:0 FPO:47941441 SZ:3923/3108/0.79 VC:6488640 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE lon: INT64 GZIP DO:0 FPO:47945364 SZ:7304271/7318724/1.00 VC:6488640 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE for_dt: BINARY GZIP DO:0 FPO:55249635 SZ:8165017/8204809/1.00 VC:6488640 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE value: DOUBLE GZIP DO:0 FPO:63414652 SZ:12459023/12554380/1.01 VC:6488640 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE _row_key: BINARY GZIP DO:0 FPO:75873675 SZ:30907500/277057233/8.96 VC:6488640 ENC:BIT_PACKED,PLAIN mypart: BINARY GZIP DO:0 FPO:106781175 SZ:4321/3185/0.74 VC:6488640 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
//Pw== extra: hoodie_min_record_key = temp2m#|#36#|#0#|#1979-06-01 00:00:00 extra: parquet.avro.schema = {"type":"record","name":"t_record","namespace":"s","fields":[{"name":"_hoodie_commit_time","type":["null","string"],"doc":""},{"name":"_hoodie_commit_seqno","type":["null","string"],"doc":""},{"name":"_hoodie_record_key","type":["null","string"],"doc":""},{"name":"_hoodie_partition_path","type":["null","string"],"doc":""},{"name":"_hoodie_file_name","type":["null","string"],"doc":""},{"name":"metric","type":["string","null"]},{"name":"lat","type":["long","null"]},{"name":"lon","type":["long","null"]},{"name":"for_dt","type":["string","null"]},{"name":"value","type":["double","null"]},{"name":"_row_key","type":"string"},{"name":"mypart","type":["string","null"]}]} extra: hoodie_max_record_key = temp2m#|#60#|#99#|#1979-07-31 23:00:00
file schema: s.t_record
_hoodie_commit_time: OPTIONAL BINARY O:UTF8 R:0 D:1 _hoodie_commit_seqno: OPTIONAL BINARY O:UTF8 R:0 D:1 _hoodie_record_key: OPTIONAL BINARY O:UTF8 R:0 D:1 _hoodie_partition_path: OPTIONAL BINARY O:UTF8 R:0 D:1 _hoodie_file_name: OPTIONAL BINARY O:UTF8 R:0 D:1 metric: OPTIONAL BINARY O:UTF8 R:0 D:1 lat: OPTIONAL INT64 R:0 D:1 lon: OPTIONAL INT64 R:0 D:1 for_dt: OPTIONAL BINARY O:UTF8 R:0 D:1 value: OPTIONAL DOUBLE R:0 D:1 _row_key: REQUIRED BINARY O:UTF8 R:0 D:0 mypart: OPTIONAL BINARY O:UTF8 R:0 D:1
row group 1: RC:6488640 TS:754711984 OFFSET:4
_hoodie_commit_time: BINARY GZIP DO:0 FPO:4 SZ:9889/8167/0.83 VC:6488640 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE _hoodie_commit_seqno: BINARY GZIP DO:0 FPO:9893 SZ:16938558/172429678/10.18 VC:6488640 ENC:BIT_PACKED,RLE,PLAIN _hoodie_record_key: BINARY GZIP DO:0 FPO:16948451 SZ:30911549/277059352/8.96 VC:6488640 ENC:BIT_PACKED,RLE,PLAIN _hoodie_partition_path: BINARY GZIP DO:0 FPO:47860000 SZ:4321/3185/0.74 VC:6488640 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE _hoodie_file_name: BINARY GZIP DO:0 FPO:47864321 SZ:72799/66978/0.92 VC:6488640 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE metric: BINARY GZIP DO:0 FPO:47937120 SZ:4321/3185/0.74 VC:6488640 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE lat: INT64 GZIP DO:0 FPO:47941441 SZ:3923/3108/0.79 VC:6488640 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE lon: INT64 GZIP DO:0 FPO:47945364 SZ:7304271/7318724/1.00 VC:6488640 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE for_dt: BINARY GZIP DO:0 FPO:55249635 SZ:8165017/8204809/1.00 VC:6488640 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE value: DOUBLE GZIP DO:0 FPO:63414652 SZ:12459023/12554380/1.01 VC:6488640 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE _row_key: BINARY GZIP DO:0 FPO:75873675 SZ:30907500/277057233/8.96 VC:6488640 ENC:BIT_PACKED,PLAIN mypart: BINARY GZIP DO:0 FPO:106781175 SZ:4321/3185/0.74 VC:6488640 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE [ec2-user@red ~]$ /usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/java -jar parquet-tools-1.9.0.jar schema 98e609f0-f77b-4ddf-a186-7b75eb02cedf_0_20190923174642.parquet message s.t_record { optional binary _hoodie_commit_time (UTF8); optional binary _hoodie_commit_seqno (UTF8); optional binary _hoodie_record_key (UTF8); optional binary _hoodie_partition_path (UTF8); optional binary _hoodie_file_name (UTF8); optional binary metric (UTF8); optional int64 lat; optional int64 lon; optional binary for_dt (UTF8); optional double value; required binary _row_key (UTF8); optional binary mypart (UTF8); }
[ec2-user@red ~]$ /usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/java -jar parquet-tools-1.9.0.jar dump -d -n 98e609f0-f77b-4ddf-a186-7b75eb02cedf_0_20190923174642.parquet row group 0
_hoodie_commit_time: BINARY GZIP DO:0 FPO:4 SZ:9889/8167/0.83 VC:6488640 ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY _hoodie_commit_seqno: BINARY GZIP DO:0 FPO:9893 SZ:16938558/172429678/10.18 VC:6488640 ENC:PLAIN,BIT_PACKED,RLE _hoodie_record_key: BINARY GZIP DO:0 FPO:16948451 SZ:30911549/277059352/8.96 VC:6488640 ENC:PLAIN,BIT_PACKED,RLE _hoodie_partition_path: BINARY GZIP DO:0 FPO:47860000 SZ:4321/3185/0.74 VC:6488640 ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY _hoodie_file_name: BINARY GZIP DO:0 FPO:47864321 SZ:72799/66978/0.92 VC:6488640 ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY metric: BINARY GZIP DO:0 FPO:47937120 SZ:4321/3185/0.74 VC:6488640 ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY lat: INT64 GZIP DO:0 FPO:47941441 SZ:3923/3108/0.79 VC:6488640 ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY lon: INT64 GZIP DO:0 FPO:47945364 SZ:7304271/7318724/1.00 VC:6488640 ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY for_dt: BINARY GZIP DO:0 FPO:55249635 SZ:8165017/8204809/1.00 VC:6488640 ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY value: DOUBLE GZIP DO:0 FPO:63414652 SZ:12459023/12554380/1.01 VC:6488640 ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY _row_key: BINARY GZIP DO:0 FPO:75873675 SZ:30907500/277057233/8.96 VC:6488640 ENC:PLAIN,BIT_PACKED mypart: BINARY GZIP DO:0 FPO:106781175 SZ:4321/3185/0.74 VC:6488640 ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY
[ec2-user@red ~]$
cc @zhenxiao @mbasmanova @bhasudha @vinothchandar