Closed ianmilligan1 closed 7 years ago
Right now, using the embarassingly hackish/lazy script of:
sed -i -- 's/((//g' *
sed -i -- 's/,/ /g' *
sed -i -- 's/)//g' *
Hi Ian,
Instead of modifying process.py to work with new format, You could modify the output format of crawl analytics to work with existing process.py.
Something that changes a tuple into a string like
output.map(r=> r._1._1 + " " + r._1._2 + " " +r._2).saveAsTextFile(outputFile)
would do.
Thanks @yb1 – far more eloquent. 😄 Is Youngbin's idea worth baking into the overall script, @lintool? Your call, but am happy to implement,
Our existing
crawl-vis
resource uses the output from our oldPig
script when runningprocess.py
. It wants the raw input to look like this:But with our move to Spark, our outputs for crawl analytics look like:
This is minor, but would be good to change
process.py
to work with the new format.