internetarchive / ia-hadoop-tools

21 stars 27 forks source link

incompatibility with Hadoop 2 #1

Closed rjoberon closed 10 years ago

rjoberon commented 11 years ago

When running pig with a simple script

REGISTER 'lib/ia-hadoop-tools-jar-with-dependencies.jar';
titles = LOAD 'WEB-20130211162744502-00000-3622~localhost~4321.warc.gz'
         USING org.archive.hadoop.ArchiveMetadataLoader();
foo = LIMIT titles 1;
dump foo;

I get the error

2013-08-06 11:53:27,869 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to recreate exception from backed error: Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected

My experience with other pig libraries was that this error is caused by a versioning problem, see https://github.com/twitter/hadoop-lzo/issues/56

It seems that ia-hadoop-tools is using Hadoop 1 (0.20.x) while Hadoop 2 is the latest version which I am using. This was solved by the elephant-bird library by providing a wrapper library that supports both Hadoop 1 and 2.

Is something planned for ia-hadoop-tools, too?

rjoberon commented 10 years ago

I could solve this issue by using the code from the ia-web-commons project:

git clone git@github.com:internetarchive/ia-web-commons.git
cd ia-web-commons
mvn -f pom-cdh3.xml install

With the resulting JAR with dependencies I could run a Pig job to extract data from WARC files.

vinaygoel commented 10 years ago

Robert, you might be interested in the Archive Analysis Workshop page I've set up: https://webarchive.jira.com/wiki/display/Iresearch/Web+Archive+Analysis+Workshop

rjoberon commented 10 years ago

Thanks a lot! This looks great and my student is already going over your tutorial and using your code.

I have a question, however: why are you parsing the textual content of WARC files into separate files and processing these? Wouldn't it be more practical (on the long run) to extend the ArchiveJSONViewLoader (and the underlying classes) to support also the HTTP response body? I added this as issue #8 to the archive-commons project.

The background is that we want to analyze the content of several terabytes of WARC files and we don't want to copy their contents into separate files if necessary.

(Note: this is probably not the right place to discuss this - you can find my e-mail address on my homepage)

vinaygoel commented 10 years ago

I included the parsed text generation as we're already generating them for our search index (as do many other partner institutions). I agree that in a use case like yours, it makes complete sense to parse out and analyze the content directly from WARC files.

I'm on vacation for the next few days, but will look into your issue/feature request when I'm back.

rjoberon commented 10 years ago

Did you already have a chance to look into the issue of using the textual content of WARC files? We would be interested to know what your plans are into that direction.

vinaygoel commented 10 years ago

Hi,

I'm sorry I haven't had time to take a look at this yet. Was out on vacation and then ended up falling sick for a couple of weeks. Will take a look in the coming days and keep you posted.

Thanks, Vinay

On Mon, Jan 20, 2014 at 2:26 AM, rjoberon notifications@github.com wrote:

Did you already have a chance to look into the issue of using the textual content of WARC files? We would be interested to know what your plans are into that direction.

— Reply to this email directly or view it on GitHubhttps://github.com/internetarchive/ia-hadoop-tools/issues/1#issuecomment-32748322 .

rjoberon commented 10 years ago

Thanks!