iipc / webarchive-commons

Common web archive utility code.
Apache License 2.0
49 stars 72 forks source link

CompressedWARCReader does not work for Common Crawl WARC files. #81

Closed YossiTamari closed 5 years ago

YossiTamari commented 5 years ago

When reading a Common Crawl WARC file (e.g. crawl-data/CC-MAIN-2018-34/segments/1534221208676.20/warc/CC-MAIN-20180814062251-20180814082251-00000.warc.gz), when iterating to the second record, in cleanupCurrentRecord(), close() moves to the end of the record, but then gotoEOR(this.currentRecord) expects one of the next 4 characters to be -1, and they're not, it is just getting the start of the next record. This results in "unexpected extra data after record" being written to stderr, followed by failing to parse any more records. It seems like removing gotoEOR will solve the problem, but I'm not sure I understand the logic behind this code, so maybe a smarter fix is needed.

anjackson commented 5 years ago

Hm, I'm afraid I can't reproduce this issue. Having downloaded that WARC

https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-34/segments/1534221208676.20/warc/CC-MAIN-20180814062251-20180814082251-00000.warc.gz

And running this code: https://gist.github.com/anjackson/6f89e18e17765930b30bf33b742209f3#file-simplewarcanalyser-java

I find I can parse the records just fine BUT the WARC-Payload-Digest appears to be wrong (the expected value is found in the WARC-Block-Digest. This was also found usingjwattools test -e CC-MAIN-20180814062251-20180814082251-00000.warc.gz.

Does your parser code look like that this: https://gist.github.com/anjackson/6f89e18e17765930b30bf33b742209f3#file-simplewarcanalyser-java-L36-L45 ?

YossiTamari commented 5 years ago

Thanks @anjackson . Your code helped me find that the problem only occurs if I gunzip the WARC and then gzip it again. I'm not sure what the source of the problem is, but I guess it's not an issue with this library.

anjackson commented 5 years ago

Glad you've got it working. WARC.gz uses concatenated gzip files and if you re-compress as a single gz it confuses all the tools.