Merge GraphX to master - Githubissues

jrwiebe commented 8 years ago

I think the ExtractGraph object is functionally ready to be merged into master, but I'm interested in feedback about how it's implemented and invoked.

At present one would call it thusly:

import org.warcbase.spark.rdd.RecordRDD._
import org.warcbase.spark.matchbox.RecordLoader
import org.warcbase.spark.matchbox.ExtractGraph

val recs = RecordLoader.loadArchives("/collections/webarchives/CanadianPoliticalParties/warc/", sc)

val graph = ExtractGraph(recs)
// Gives you a resilient distributed Graph, with edges and vertices contain the following:
//      case class VertexData(domain: String, pageRank: Double, inDegree: Int, outDegree: Int)
//      case class EdgeData(date: String, src: String, dst: String)

// Here you have the opportunity to manipulate graph if you like (e.g., subgraph).
// See https://spark.apache.org/docs/latest/graphx-programming-guide.html#summary-list-of-operators

// To write the graph as JSON, do this:
graph.writeAsJson("nodes-dir", "links-dir")

A few thoughts:

Might it be desirable to do less with the apply method (invoked when you call ExtractGraph(recs))? Currently it generates the Graph from our archive RDD, and runs inDegrees, outDegrees, and PageRank. Should we give the user more fine-grained control and just generate the graph and then have a separate method getStats (or getInDegrees, getOutDegrees, getPageRank) that runs the latter operations? Should we parameterize the PageRank call, allowing for choice between dynamic PageRank (pageRank(tolerance: Double)) and static (staticPageRank(numIterations: Int))? Or does this complicate usage too much?
Nomenclature: we have several other Extract* objects, which is why I used this name. But does "extract" seem like the right verb for what we're doing here? Would something like GraphBuilder or simply Graph be a better name for this?
I could make the apply method take the ArchiveRecord RDD implicitly, so it could be called through our fluent interface. E.g., in one shot, RecordLoader.loadArchives("/path/to/warcs/", sc).keepLanguages(Set("fr")).(ExtractGraph.writeAsJson("nodes", "links"). Is it better, however, to limit the number of data types we pass through the method chain?

All of these things involve only minor changes and aren't maybe very consequential, but obviously it's preferable to settle these basic details before code examples start appearing in our published papers.

ianmilligan1 commented 8 years ago

Having tinkered with this branch quite a bit on rho, my sense is the dynamic PageRank is too much for a single node, whereas it's best suited for clustering (the next test will be to see what kind of impact it has on our results).

I think some sort of parameterization would be useful for dynamic vs. static.

I like ExtractGraph due to consistency.

greebie commented 8 years ago

I wonder if it would be useful to include some summary information about the network along with the graph results, mostly for interfacing purposes. eg. Average in/out-degree, path-length, density etc. I am thinking of the end-user who will be paranoid that their monstrous graph has somehow tapped a wormhole to a distant planet, producing wonky and invalid results. These sort of measures can act as a checksum.

Connected and strongly connected component id may be more important than page rank, since that's a common way to produce subgraphs. At the end of the day, I can convert the file to .gml or .graphml and do the heavy analysis in R. Pagerank is probably useful for the usability purposes I mentioned above, though.

I can look into algorithms for other flavors of centrality, although it appears you are relying on graphx for results and it may be too expensive to produce new.

jrwiebe commented 8 years ago

For now I have parameterized pageRank/staticPageRank and tolerance/iterations. Graphx offers connected- and strongly-connected components algorithms, which I'll experiment with when I have a chance.

Calls to ExtractGraph now take this form: val graph = ExtractGraph(recs, dynamic = true, tolerance = 0.0001) or val graph = ExtractGraph(recs, dynamic = false, numIter = 4)

You can leave off the third parameter, in which case the default tolerance is 0.001 and the default numIter is 3.

To combine the JSON lines of the output files you can use jq: `$ jq -c -n --slurpfile nodes <(cat nodes-dir/part-*) --slurpfile links <(cat links-dir/part-*) '{nodes: $nodes, links: $links}' > graph.json

ianmilligan1 commented 8 years ago

Great! Draft documentation available here.

greebie commented 8 years ago

Looks great!

Another small suggestion for future. If the key "count" for the edges is called "weight" instead, most network programs will automatically accept the value as relevant to a data visualisation (ie. it will thicken the arrows based on the weighted value). Just a small thing that will save the end user some time.

lintool / warcbase

Merge GraphX to master #207