lintool / warcbase

Warcbase is an open-source platform for managing analyzing web archives
http://warcbase.org/
161 stars 47 forks source link

Crawl Visualization #243

Closed ianmilligan1 closed 7 years ago

ianmilligan1 commented 8 years ago

Our existing crawl-vis resource uses the output from our old Pig script when running process.py. It wants the raw input to look like this:

201508  www.davidsuzuki.org 10740
201508  www.liberal.ca  9323
201508  www.greenparty.ca   7465
201508  www.greenparty.ca   6989
201508  www.policyalternatives.ca   6690
201508  www.policyalternatives.ca   6501

But with our move to Spark, our outputs for crawl analytics look like:

((201304,education.alberta.ca),21613)
((201207,education.alberta.ca),16056)
((201301,ubiqcomputing.org),13472)
((201210,education.alberta.ca),12177)
((201301,www.ubiqcomputing.org),11953)
((201301,education.alberta.ca),10219)
((201310,education.alberta.ca),7849)
((201404,education.alberta.ca),6371)
((201410,education.alberta.ca),5605)

This is minor, but would be good to change process.py to work with the new format.

ianmilligan1 commented 8 years ago

Right now, using the embarassingly hackish/lazy script of:

sed -i -- 's/((//g' *
sed -i -- 's/,/ /g' *
sed -i -- 's/)//g' *
youngbink commented 8 years ago

Hi Ian,

Instead of modifying process.py to work with new format, You could modify the output format of crawl analytics to work with existing process.py.

Something that changes a tuple into a string like

output.map(r=> r._1._1 + " " + r._1._2 + " " +r._2).saveAsTextFile(outputFile)

would do.

ianmilligan1 commented 8 years ago

Thanks @yb1 – far more eloquent. 😄 Is Youngbin's idea worth baking into the overall script, @lintool? Your call, but am happy to implement,