iipc / jwarc

Java library for reading and writing WARC files with a typed API
Apache License 2.0
46 stars 8 forks source link

UncheckedIOException, invalid WARC record error #84

Closed gleporeNARA closed 5 months ago

gleporeNARA commented 6 months ago

For this file I'm getting an "invalid WARC record" error.

https://eotarchive.s3.amazonaws.com/crawl-data/EOT-2004/segments/NARA-000/warc/NARA-PEOT-2004-20041015060312-00241-crawling009-c_NARA-PEOT-2004-20041015071841-00279-crawling009.archive.org.arc.gz

Here's the error:

Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.warc.WARCParser@5e85c21b at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:312) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:1069) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:493) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:256) Caused by: java.io.UncheckedIOException: org.netpreserve.jwarc.ParsingException: invalid WARC record at position 71: ...og/lm_requestform.cfm?cFileno=WEL 802.02<-- HERE -->(B)&cDocno=LTSM00012740&cLoc=Internet%20... at org.netpreserve.jwarc.WarcReader$1.hasNext(WarcReader.java:329) at org.apache.tika.parser.warc.WARCParser.parse(WARCParser.java:88) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ... 5 more Caused by: org.netpreserve.jwarc.ParsingException: invalid WARC record at position 71: ...og/lm_requestform.cfm?cFileno=WEL 802.02<-- HERE -->(B)&cDocno=LTSM00012740&cLoc=Internet%20... at org.netpreserve.jwarc.WarcParser.parse(WarcParser.java:315) at org.netpreserve.jwarc.WarcReader.next(WarcReader.java:159) at org.netpreserve.jwarc.WarcReader$1.hasNext(WarcReader.java:327)

gleporeNARA commented 5 months ago

Following up on this a bit, I analyzed the 58,901 ARC files that I'm working with and 10,763 have this Exception. My worry is that this Exception might be preventing access to records further along in the ARC file and that would be bad for my project. The analysis also revealed tons of "ERROR: invalid HTTP header Content-Length" errors, but I'm not sure what effect those have on processing the data.

These files were created by the Internet Archive back in 2004, so presumably they are correctly formatted (at least according to their interpretation of the spec.)

Thanks.

ato commented 5 months ago

This record contains an invalid URL that contains spaces, so parts of the URL end up in the wrong field.

gleporeNARA commented 5 months ago

Right, I see that, thanks! I am pursuing this with the Internet Archive as they created the file. Oddly enough the CDX file they sent that corresponds to this file has the space correctly encoded. Closing.