iipc / jwarc

Java library for reading and writing WARC files with a typed API
Apache License 2.0
46 stars 8 forks source link

IllegalArgumentException on ARC Parsing #83

Closed gleporeNARA closed 6 months ago

gleporeNARA commented 7 months ago

Many thanks for fixing the issue with the newlines before records in ARC files. Tika is now able to process files, with the below exceptions.

Some files are giving an error:

NARA-PEOT-2004-20041109221858-00339-crawling004.archive.org.arc.gz

java.lang.IllegalArgumentException: parse error at position 20: text/html;ISO-8859-1<-- HERE --> at org.netpreserve.jwarc.MediaType.parse(MediaType.java:386) at java.base/java.util.Optional.map(Optional.java:260) at org.netpreserve.jwarc.Message.contentType(Message.java:61) at org.netpreserve.jwarc.WarcResponse$1.type(WarcResponse.java:71) at java.base/java.util.Optional.map(Optional.java:260) at org.netpreserve.jwarc.WarcResponse.payloadType(WarcResponse.java:62) at org.apache.tika.parser.warc.WARCParser.processResponse(WARCParser.java:135) ...lots of other Tika messages

These files were all created by the Internet Archive back in 2004. Attached is the file that produced the above error.

NARA-PEOT-2004-20041109221858-00339-crawling004.archive.org.arc.gz

ato commented 6 months ago

Released v0.29.0 which adds MediaType.parseLeniently() and uses it in Message.contentType().

In this case the invalid parameter which is missing "=" will be simply ignored instead of throwing IllegalArgumentException. When using the lenient parser validity can be checked with mediaType.isValid() and the original string accessed with mediaType.raw().