lintool / warcbase

Warcbase is an open-source platform for managing analyzing web archives
http://warcbase.org/
161 stars 47 forks source link

Process.py needs to be redone for Spark (designed for Pig) #224

Closed ianmilligan1 closed 8 years ago

ianmilligan1 commented 8 years ago

Currently, process.py for the d3.js link visualizer works on an input file like so:

200510  acq.osd.mil acq.osd.mil 96
200510  acq.osd.mil akss.dau.mil    12
200510  agoracosmopolite.com    agorabookcafe.com   325
200510  agoracosmopolite.com    agoracosmopolitan.com   271
200510  agoracosmopolite.com    agoracosmopolite.com    8319
200510  agoracosmopolite.com    genesmedia.com  325
200510  bloc.org    go.microsoft.com    22
200510  blocpot.qc.ca   blocpot.qc.ca   104

Our new input files look like:

((20160130,globalnews.ca,globalnews.ca),827840)
((20160130,huffingtonpost.ca,huffingtonpost.ca),713363)
((20160129,globalnews.ca,globalnews.ca),409791)
((20160130,huffingtonpost.ca,huffingtonpost.com),396347)
((20160130,theglobeandmail.com,theglobeandmail.com),388364)
((20160130,ottawacitizen.com,ottawacitizen.com),226124)
((20160129,huffingtonpost.ca,huffingtonpost.ca),194062)

process.py needs to be updated.

ianmilligan1 commented 8 years ago

I've created a new branch for testing: https://github.com/lintool/warcbase/tree/fixing-d3js-scripts

jrwiebe commented 8 years ago

process.py shouldn't actually be there; it's a remnant from before I had GraphX working. Instead, once you generate your directories of nodes and links, you combine them using jq as follows:

$ jq -c -n --slurpfile nodes <(cat nodes/part-*) --slurpfile links \
  <(cat links/part-*) '{nodes: $nodes, links: $links}' > graph.json
ianmilligan1 commented 8 years ago

OK. Documented here http://lintool.github.io/warcbase-docs/Spark-Network-Analysis/#visualizing-results-in-a-browser-with-d3js and closing.