internetarchive / ia-hadoop-tools

21 stars 27 forks source link

running CDXGenerator with certain environment variables set does not work #6

Open rjoberon opened 9 years ago

rjoberon commented 9 years ago

I am trying to ensure that the libraries packaged in the ia-hadoop-tools-jar-with-dependencies.jar JAR file of this project are loaded before the other libraries from Hadoop (because they contain an incompatible Guava version). Since I am using Hadoop 2 I am using the environment variables HADOOP_USER_CLASSPATH_FIRST and HADOOP_CLASSPATH to do this:

JAR=/home/jaeschke/warc_test/ia-hadoop-tools/target/ia-hadoop-tools-jar-with-dependencies.jar
export HADOOP_USER_CLASSPATH_FIRST=true
export HADOOP_CLASSPATH=$JAR
yarn jar $JAR CDXGenerator -soft $DATA_DIR/derived-data/cdx/ "$DATA_DIR/*.warc.gz"

However, this script always fails with the following output:

14/12/17 21:25:41 INFO jobs.CDXGenerator: No input files to CDXGenerator.

It seems that the CDXGenerator does not get the command line arguments. I have no idea why. One guess is that the wrong library to parse the command line arguments is used now or that that library is missing.

rjoberon commented 9 years ago

This might be caused by bugs in hadoop and hdfs or by the fact that webarchive-commons depends on hadoop-core in version 0.20.2-cdh3u6 (while we have CDH5 with version 2.5.0-mr1-cdh5.2.1). I tried to solve the problem using the maven shade plugin, unfortunately without any success. I first thought it's because the shade plugin only affects the classes of the current project and not classes included from dependencies into the uber JAR but I this answer suggests that this is not true.

These incompatibilities with Guava seem to be a common problem without a good solution, yet. :-(