Closed ianmilligan1 closed 8 years ago
This will do what you're asking for:
import org.warcbase.spark.matchbox.RecordTransformers._
import org.warcbase.spark.matchbox.{ExtractTopLevelDomain, ExtractLinks, RecordLoader}
import org.warcbase.spark.rdd.RecordRDD._
RecordLoader.loadArc("/mnt/vol1/data_sets/cpp_arcs/", sc)
.discardDate(null)
.keepMimeTypes(Set("text/html"))
.map(r => (r.getCrawldate, ExtractLinks(r.getUrl, r.getContentString)))
.flatMap(r => r._2.map(f => (r._1.substring(0,6), ExtractTopLevelDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractTopLevelDomain(f._2).replaceAll("^\\s*www\\.", ""))))
.filter(r => r._2 != null && r._3 != null)
.countItems()
.filter(r => r._2 > 5)
.groupBy(_._1._1)
.flatMap(r => r._2)
.saveAsTextFile("cpp.sitelinks-groupedByMonth")
The critical changes are line 9 (the addition of .substring(0,6)
selects the first 6 characters – YYYYMM – of the date string) and lines 13 and 14 (groupBy()
groups the data by the date string).
If you're going to try to run this on the cluster, as I did, using all the ARC files, you're going to have to fiddle with --executor-memory
. I kept running out of heap space.
The results look like this:
((200701,policyalternatives.ca,policyalternatives.ca),6234276)
((200701,fairvotecanada.org,fairvotecanada.org),908615)
((200701,greenparty.ca,contact.greenparty.ca),150519)
((200701,conservative.ca,conservative.ca),119375)
((200701,greenparty.ca,secure.greenparty.ca),100371)
((200701,greenparty.ca,ridings.greenparty.ca),100360)
((200701,policyalternatives.ca,),97147)
((200701,policyalternatives.ca,pencilneck.net),96190)
((200701,policyalternatives.ca,raisedeyebrow.com),96190)
((200701,ndp.ca,ndp.ca),89337)
((200701,egale.ca,egale.ca),74661)
((200701,policyalternatives.ca,adobe.com),55703)
((200701,greenparty.ca,community.greenparty.ca),50308)
((200701,greenparty.ca,greenparty.ca),50238)
((200701,greenparty.ca,web.greenparty.ca),50212)
((200701,greenparty.ca,partivert.ca),50176)
((200701,greenparty.ca,validator.w3.org),50171)
((200701,davidsuzuki.org,davidsuzuki.org),40867)
[etc.]
BTW, if you don't want the parens, you can write a final map that converts into a string, something like
.map( r => r._1._1 + "\t" + r._1._2 + "\t" + r._1._3 + "\t" + r._2)
Our current output of Spark: Analysis of Site Link Structure generates a series of
part-m-0000x
files like this:Our previous version generated a series of
part-m-0000x
files like this:It had the advantage of being very easy to import into Excel, and/or Gephi. More importantly, it aggregated crawls by month (@jrwiebe implemented this, I believe). I think crawl months are more useful (and potentially, an option for crawl years).
Thoughts on pros/cons/etc. of doing this sort of data transformation? And should this be baked right into the Spark script.