lintool / warcbase

Warcbase is an open-source platform for managing analyzing web archives
http://warcbase.org/
161 stars 47 forks source link

Alter Scripts to Allow Analysis by Date as well as Month #141

Closed ianmilligan1 closed 9 years ago

ianmilligan1 commented 9 years ago

Current output of most scripts generates monthly data: for example, the plain text extractor UDF produces output beginning like:

201302  cla.ca  HTTP/1.1 200 OK Connection: close Date: Thu, 21 Feb 2013 01:22:12

We should document and provide options so that 201302 above could become 20130221 (or beyond, if we were using this in a multiple-crawl per day scenario).

ianmilligan1 commented 9 years ago

Current workaround is to change this line:

foreach a generate SUBSTRING(date, 0, 6)

to

foreach a generate SUBSTRING(date, 0, 8)

So we grab the first eight characters instead of the first six. This might be the easiest way? If agreement, I can update wiki accordingly.

jrwiebe commented 9 years ago

That's exactly what must be changed to make the scripts generate a YYYYMMDD string. To give users the option to select month or date output we could make the scripts accept a parameter ( https://wiki.apache.org/pig/ParameterSubstitution). Most of our Pig scripts would need to be changed. Some of the Python scripts I've written to process output would need to be changed too, which is no big deal.

Jeremy

On Wed, Aug 12, 2015 at 10:29 AM, Ian Milligan notifications@github.com wrote:

Current workaround is to change this line:

foreach a generate SUBSTRING(date, 0, 6)

to

foreach a generate SUBSTRING(date, 0, 8)

So we grab the first eight characters instead of the first six. This might be the easiest way? If agreement, I can update wiki accordingly.

— Reply to this email directly or view it on GitHub https://github.com/lintool/warcbase/issues/141#issuecomment-130322003.

lintool commented 9 years ago

@ianmilligan1 I'm closing this ticket because we're moving from Pig to Spark and this will all be more cleaning done in Spark... or at least @aliceranzhou will make it so! :)