Closed willp-bl closed 10 years ago
I see this change has been made in 2.0-dev branch
heretrix-commons 3.1.2-SNAPSHOT works
I assume you mean heretrix-commons 3.1.2-SNAPSHOT works if added to the master branch?
Development is on 2.0.0 at the moment, and that branch should work independently of H3 (unless I've missed something).
Note also that Common Crawl also link to useful tools, including record readers built against a more recent Hadoop API.
Thanks for the tips, will try and use 2.0.0-SNAPSHOT for now
Ah, I got the link wrong for those record readers:
https://github.com/Smerity/cc-warc-examples/tree/master/src/org/commoncrawl/warc
When using uncompressed ARC files with Nanite, which uses warc-hadoop-recordreaders, an exception is thrown.
ARCReaderFactory in 3.1.0/3.1.1 has two methods to open an ARC, only one of which tests for compression. (See https://github.com/iipc/heritrix3/blob/3.1.1/commons/src/main/java/org/archive/io/arc/ARCReaderFactory.java#L107). The method that tests for compression is not called, thus an exception is thrown.
This appears to be fixed in webarchive-commons and the master branch of heretrix-commons.
So - is it possible for the dependency be updated?