Closed rjoberon closed 10 years ago
I could solve this issue by using the code from the ia-web-commons project:
git clone git@github.com:internetarchive/ia-web-commons.git
cd ia-web-commons
mvn -f pom-cdh3.xml install
With the resulting JAR with dependencies I could run a Pig job to extract data from WARC files.
Robert, you might be interested in the Archive Analysis Workshop page I've set up: https://webarchive.jira.com/wiki/display/Iresearch/Web+Archive+Analysis+Workshop
Thanks a lot! This looks great and my student is already going over your tutorial and using your code.
I have a question, however: why are you parsing the textual content of WARC files into separate files and processing these? Wouldn't it be more practical (on the long run) to extend the ArchiveJSONViewLoader (and the underlying classes) to support also the HTTP response body? I added this as issue #8 to the archive-commons project.
The background is that we want to analyze the content of several terabytes of WARC files and we don't want to copy their contents into separate files if necessary.
(Note: this is probably not the right place to discuss this - you can find my e-mail address on my homepage)
I included the parsed text generation as we're already generating them for our search index (as do many other partner institutions). I agree that in a use case like yours, it makes complete sense to parse out and analyze the content directly from WARC files.
I'm on vacation for the next few days, but will look into your issue/feature request when I'm back.
Did you already have a chance to look into the issue of using the textual content of WARC files? We would be interested to know what your plans are into that direction.
Hi,
I'm sorry I haven't had time to take a look at this yet. Was out on vacation and then ended up falling sick for a couple of weeks. Will take a look in the coming days and keep you posted.
Thanks, Vinay
On Mon, Jan 20, 2014 at 2:26 AM, rjoberon notifications@github.com wrote:
Did you already have a chance to look into the issue of using the textual content of WARC files? We would be interested to know what your plans are into that direction.
— Reply to this email directly or view it on GitHubhttps://github.com/internetarchive/ia-hadoop-tools/issues/1#issuecomment-32748322 .
Thanks!
When running pig with a simple script
I get the error
My experience with other pig libraries was that this error is caused by a versioning problem, see https://github.com/twitter/hadoop-lzo/issues/56
It seems that ia-hadoop-tools is using Hadoop 1 (0.20.x) while Hadoop 2 is the latest version which I am using. This was solved by the elephant-bird library by providing a wrapper library that supports both Hadoop 1 and 2.
Is something planned for ia-hadoop-tools, too?