ParsingException when reading ClueWeb09 files

gijshendriksen commented 2 months ago

Hi, I would like to use jwarc to parse the files in the ClueWeb09 collection. However, for some of the files (for instance, parts/ClueWeb09_English_1/en0000/09.warc.gz), parsing fails with the following exception:

org.netpreserve.jwarc.ParsingException: invalid WARC record at position 79: ...rget-URI: http://2fered.tistory.com/tag/<-- HERE -->\x08\xffffffc3\xffffff80\xffffffc2\xffffffa4\xffffffc2\xffffffb8\xffffffc2\xffffffac\r\nWARC-Warcinfo-ID: 5fdd2301-6c...
    at org.netpreserve.jwarc.WarcParser.parse(WarcParser.java:315)
    at org.netpreserve.jwarc.WarcReader.next(WarcReader.java:159)
    ...

There are some known encoding issues with the ClueWeb09 data (e.g. as discussed in this paper), so it sounds possible that that is the underlying issue here. Is there a way jwarc can deal with such 'dirty' web content? Or could it perhaps be caused by another issue?

Note: I just saw the very related issue #26, but it seems that is caused by something different. The ParsingException happens in the middle of the WARC-Target-URI field, not at the end of a line at the CRLF characters.

wumpus commented 2 months ago

Looks like there's no defense against invalid utf8 in the url? Not surprised to see that in a WARC, there are occasions in the past where Common Crawl has written such bad urls in our WARCs 😅

gijshendriksen commented 2 months ago

@ato thanks for the incredibly quick fix in 9771f23! I have tested it on the WARC file mentioned above, and it now successfully parses the whole file.

ato commented 2 months ago

Glad that worked. I've released it as version 0.30.0, it should sync to maven central in an hour or so.

iipc / jwarc

ParsingException when reading ClueWeb09 files #86