iipc / jwarc

Java library for reading and writing WARC files with a typed API
Apache License 2.0
46 stars 8 forks source link

UncheckedIOException, unexpected end of gzip #85

Closed gleporeNARA closed 6 months ago

gleporeNARA commented 6 months ago

Also for https://eotarchive.s3.amazonaws.com/crawl-data/EOT-2004/segments/NARA-000/warc/NARA-PEOT-2004-20041014205819-00000-crawling009-c_NARA-PEOT-2004-20041014230214-00043-crawling009.archive.org.arc.gz

Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.warc.WARCParser@26d7cb0d at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:312) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:1069) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:493) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:256) Caused by: java.io.UncheckedIOException: java.io.EOFException: unexpected end of gzip stream at org.netpreserve.jwarc.WarcReader$1.hasNext(WarcReader.java:329) at org.apache.tika.parser.warc.WARCParser.parse(WARCParser.java:88) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ... 5 more Caused by: java.io.EOFException: unexpected end of gzip stream at org.netpreserve.jwarc.GunzipChannel.read(GunzipChannel.java:53) at org.netpreserve.jwarc.LengthedBody.consume(LengthedBody.java:144) at org.netpreserve.jwarc.WarcReader.next(WarcReader.java:147) at org.netpreserve.jwarc.WarcReader$1.hasNext(WarcReader.java:327) ... 7 more

gleporeNARA commented 6 months ago

Closing this one, possibly a bad download.