Closed YossiTamari closed 5 years ago
Hm, I'm afraid I can't reproduce this issue. Having downloaded that WARC
https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-34/segments/1534221208676.20/warc/CC-MAIN-20180814062251-20180814082251-00000.warc.gz
And running this code: https://gist.github.com/anjackson/6f89e18e17765930b30bf33b742209f3#file-simplewarcanalyser-java
I find I can parse the records just fine BUT the WARC-Payload-Digest
appears to be wrong (the expected value is found in the WARC-Block-Digest
. This was also found usingjwattools test -e CC-MAIN-20180814062251-20180814082251-00000.warc.gz
.
Does your parser code look like that this: https://gist.github.com/anjackson/6f89e18e17765930b30bf33b742209f3#file-simplewarcanalyser-java-L36-L45 ?
Thanks @anjackson . Your code helped me find that the problem only occurs if I gunzip the WARC and then gzip it again. I'm not sure what the source of the problem is, but I guess it's not an issue with this library.
Glad you've got it working. WARC.gz uses concatenated gzip files and if you re-compress as a single gz it confuses all the tools.
When reading a Common Crawl WARC file (e.g. crawl-data/CC-MAIN-2018-34/segments/1534221208676.20/warc/CC-MAIN-20180814062251-20180814082251-00000.warc.gz), when iterating to the second record, in cleanupCurrentRecord(), close() moves to the end of the record, but then gotoEOR(this.currentRecord) expects one of the next 4 characters to be -1, and they're not, it is just getting the start of the next record. This results in "unexpected extra data after record" being written to stderr, followed by failing to parse any more records. It seems like removing gotoEOR will solve the problem, but I'm not sure I understand the logic behind this code, so maybe a smarter fix is needed.