Closed ianmilligan1 closed 9 years ago
Current workaround is to change this line:
foreach a generate SUBSTRING(date, 0, 6)
to
foreach a generate SUBSTRING(date, 0, 8)
So we grab the first eight characters instead of the first six. This might be the easiest way? If agreement, I can update wiki accordingly.
That's exactly what must be changed to make the scripts generate a YYYYMMDD string. To give users the option to select month or date output we could make the scripts accept a parameter ( https://wiki.apache.org/pig/ParameterSubstitution). Most of our Pig scripts would need to be changed. Some of the Python scripts I've written to process output would need to be changed too, which is no big deal.
Jeremy
On Wed, Aug 12, 2015 at 10:29 AM, Ian Milligan notifications@github.com wrote:
Current workaround is to change this line:
foreach a generate SUBSTRING(date, 0, 6)
to
foreach a generate SUBSTRING(date, 0, 8)
So we grab the first eight characters instead of the first six. This might be the easiest way? If agreement, I can update wiki accordingly.
— Reply to this email directly or view it on GitHub https://github.com/lintool/warcbase/issues/141#issuecomment-130322003.
@ianmilligan1 I'm closing this ticket because we're moving from Pig to Spark and this will all be more cleaning done in Spark... or at least @aliceranzhou will make it so! :)
Current output of most scripts generates monthly data: for example, the plain text extractor UDF produces output beginning like:
We should document and provide options so that
201302
above could become20130221
(or beyond, if we were using this in a multiple-crawl per day scenario).