Process.py needs to be redone for Spark (designed for Pig)

ianmilligan1 commented 8 years ago

Currently, process.py for the d3.js link visualizer works on an input file like so:

200510  acq.osd.mil acq.osd.mil 96
200510  acq.osd.mil akss.dau.mil    12
200510  agoracosmopolite.com    agorabookcafe.com   325
200510  agoracosmopolite.com    agoracosmopolitan.com   271
200510  agoracosmopolite.com    agoracosmopolite.com    8319
200510  agoracosmopolite.com    genesmedia.com  325
200510  bloc.org    go.microsoft.com    22
200510  blocpot.qc.ca   blocpot.qc.ca   104

Our new input files look like:

((20160130,globalnews.ca,globalnews.ca),827840)
((20160130,huffingtonpost.ca,huffingtonpost.ca),713363)
((20160129,globalnews.ca,globalnews.ca),409791)
((20160130,huffingtonpost.ca,huffingtonpost.com),396347)
((20160130,theglobeandmail.com,theglobeandmail.com),388364)
((20160130,ottawacitizen.com,ottawacitizen.com),226124)
((20160129,huffingtonpost.ca,huffingtonpost.ca),194062)

process.py needs to be updated.

ianmilligan1 commented 8 years ago

I've created a new branch for testing: https://github.com/lintool/warcbase/tree/fixing-d3js-scripts

jrwiebe commented 8 years ago

process.py shouldn't actually be there; it's a remnant from before I had GraphX working. Instead, once you generate your directories of nodes and links, you combine them using jq as follows:

$ jq -c -n --slurpfile nodes <(cat nodes/part-*) --slurpfile links \
  <(cat links/part-*) '{nodes: $nodes, links: $links}' > graph.json

ianmilligan1 commented 8 years ago

OK. Documented here http://lintool.github.io/warcbase-docs/Spark-Network-Analysis/#visualizing-results-in-a-browser-with-d3js and closing.

lintool / warcbase

Process.py needs to be redone for Spark (designed for Pig) #224