ukwa / webarchive-discovery

WARC and ARC indexing and discovery tools.
https://github.com/ukwa/webarchive-discovery/wiki
113 stars 25 forks source link

Bug due to old version of heretrix-commons/webarchive-commons #34

Closed willp-bl closed 10 years ago

willp-bl commented 10 years ago

When using uncompressed ARC files with Nanite, which uses warc-hadoop-recordreaders, an exception is thrown.

ARCReaderFactory in 3.1.0/3.1.1 has two methods to open an ARC, only one of which tests for compression. (See https://github.com/iipc/heritrix3/blob/3.1.1/commons/src/main/java/org/archive/io/arc/ARCReaderFactory.java#L107). The method that tests for compression is not called, thus an exception is thrown.

This appears to be fixed in webarchive-commons and the master branch of heretrix-commons.

So - is it possible for the dependency be updated?

willp-bl commented 10 years ago

I see this change has been made in 2.0-dev branch

willp-bl commented 10 years ago

heretrix-commons 3.1.2-SNAPSHOT works

anjackson commented 10 years ago

I assume you mean heretrix-commons 3.1.2-SNAPSHOT works if added to the master branch?

Development is on 2.0.0 at the moment, and that branch should work independently of H3 (unless I've missed something).

Note also that Common Crawl also link to useful tools, including record readers built against a more recent Hadoop API.

willp-bl commented 10 years ago

Thanks for the tips, will try and use 2.0.0-SNAPSHOT for now

anjackson commented 10 years ago

Ah, I got the link wrong for those record readers:

https://github.com/Smerity/cc-warc-examples/tree/master/src/org/commoncrawl/warc