ukwa / webarchive-discovery

WARC and ARC indexing and discovery tools.
https://github.com/ukwa/webarchive-discovery/wiki
113 stars 24 forks source link

Make the HdfsFileHasher H3/H020 compatible #272

Closed anjackson closed 1 year ago

anjackson commented 2 years ago

Trying to use the HdfsFileHasher and seeing:

2021-11-22 12:21:08,011 INFO [main] org.apache.hadoop.mapred.MapTask: Ignoring exception during close for org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader@7a7471ce
java.lang.NullPointerException
    at uk.bl.wa.hadoop.mapreduce.lib.input.ByteBlockRecordReader.close(ByteBlockRecordReader.java:52)
    at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.close(MapTask.java:536)
    at org.apache.hadoop.mapred.MapTask.closeQuietly(MapTask.java:2075)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:809)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:348)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:178)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:172)
2021-11-22 12:21:08,012 INFO [main] org.apache.hadoop.mapred.MapTask: Starting flush of map output
2021-11-22 12:21:08,019 INFO [main] org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
2021-11-22 12:21:08,019 INFO [main] org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor [.deflate]
2021-11-22 12:21:08,027 ERROR [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
    at uk.bl.wa.hadoop.mapreduce.lib.input.ByteBlockRecordReader.initialize(ByteBlockRecordReader.java:77)
    at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:561)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:348)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:178)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:172)

Presumably, given the indexer runs fine, the simplest approach is to port this to the older org.apache.hadoop.mapred API, rather than the org.apache.hadoop.mapreduce API.

anjackson commented 2 years ago

Less urgent as we have a Python one.