lintool / warcbase

Warcbase is an open-source platform for managing analyzing web archives
http://warcbase.org/
161 stars 47 forks source link

Merge GraphX to master #207

Closed jrwiebe closed 8 years ago

jrwiebe commented 8 years ago

I think the ExtractGraph object is functionally ready to be merged into master, but I'm interested in feedback about how it's implemented and invoked.

At present one would call it thusly:

import org.warcbase.spark.rdd.RecordRDD._
import org.warcbase.spark.matchbox.RecordLoader
import org.warcbase.spark.matchbox.ExtractGraph

val recs = RecordLoader.loadArchives("/collections/webarchives/CanadianPoliticalParties/warc/", sc)

val graph = ExtractGraph(recs)
// Gives you a resilient distributed Graph, with edges and vertices contain the following:
//      case class VertexData(domain: String, pageRank: Double, inDegree: Int, outDegree: Int)
//      case class EdgeData(date: String, src: String, dst: String)

// Here you have the opportunity to manipulate graph if you like (e.g., subgraph).
// See https://spark.apache.org/docs/latest/graphx-programming-guide.html#summary-list-of-operators

// To write the graph as JSON, do this:
graph.writeAsJson("nodes-dir", "links-dir")

A few thoughts:

All of these things involve only minor changes and aren't maybe very consequential, but obviously it's preferable to settle these basic details before code examples start appearing in our published papers.

ianmilligan1 commented 8 years ago

Having tinkered with this branch quite a bit on rho, my sense is the dynamic PageRank is too much for a single node, whereas it's best suited for clustering (the next test will be to see what kind of impact it has on our results).

I think some sort of parameterization would be useful for dynamic vs. static.

I like ExtractGraph due to consistency.

greebie commented 8 years ago

I wonder if it would be useful to include some summary information about the network along with the graph results, mostly for interfacing purposes. eg. Average in/out-degree, path-length, density etc. I am thinking of the end-user who will be paranoid that their monstrous graph has somehow tapped a wormhole to a distant planet, producing wonky and invalid results. These sort of measures can act as a checksum.

Connected and strongly connected component id may be more important than page rank, since that's a common way to produce subgraphs. At the end of the day, I can convert the file to .gml or .graphml and do the heavy analysis in R. Pagerank is probably useful for the usability purposes I mentioned above, though.

I can look into algorithms for other flavors of centrality, although it appears you are relying on graphx for results and it may be too expensive to produce new.

jrwiebe commented 8 years ago

For now I have parameterized pageRank/staticPageRank and tolerance/iterations. Graphx offers connected- and strongly-connected components algorithms, which I'll experiment with when I have a chance.

Calls to ExtractGraph now take this form: val graph = ExtractGraph(recs, dynamic = true, tolerance = 0.0001) or val graph = ExtractGraph(recs, dynamic = false, numIter = 4)

You can leave off the third parameter, in which case the default tolerance is 0.001 and the default numIter is 3.

To combine the JSON lines of the output files you can use jq: `$ jq -c -n --slurpfile nodes <(cat nodes-dir/part-*) --slurpfile links <(cat links-dir/part-*) '{nodes: $nodes, links: $links}' > graph.json

ianmilligan1 commented 8 years ago

Great! Draft documentation available here.

greebie commented 8 years ago

Looks great!

Another small suggestion for future. If the key "count" for the edges is called "weight" instead, most network programs will automatically accept the value as relevant to a data visualisation (ie. it will thicken the arrows based on the weighted value). Just a small thing that will save the end user some time.