lintool / warcbase

Warcbase is an open-source platform for managing analyzing web archives
http://warcbase.org/
161 stars 47 forks source link

Site Link Structure Output, Group by Month? #187

Closed ianmilligan1 closed 8 years ago

ianmilligan1 commented 8 years ago

Our current output of Spark: Analysis of Site Link Structure generates a series of part-m-0000x files like this:

((20080612,liberal.ca,liberal.ca),1832983)
((20060326,ndp.ca,ndp.ca),1801775)
((20060426,ndp.ca,ndp.ca),1771993)
((20060325,policyalternatives.ca,policyalternatives.ca),1735154)

Our previous version generated a series of part-m-0000x files like this:

200603  hc-sc.gc.ca     hc-sc.gc.ca     32
200603  heritagefront.com       canadafirst.net 51
200603  heritagefront.com       canadianfreespeech.com  13
200603  heritagefront.com       freedomsite.org 80
200603  heritagefront.com       heritagefront.com       763
200603  hrw.org hrw.org 132
200603  ican-ncfr.org   ican-ncfr.org   18
200603  impacs.org      impacs.org      38

It had the advantage of being very easy to import into Excel, and/or Gephi. More importantly, it aggregated crawls by month (@jrwiebe implemented this, I believe). I think crawl months are more useful (and potentially, an option for crawl years).

Thoughts on pros/cons/etc. of doing this sort of data transformation? And should this be baked right into the Spark script.

jrwiebe commented 8 years ago

This will do what you're asking for:

import org.warcbase.spark.matchbox.RecordTransformers._
import org.warcbase.spark.matchbox.{ExtractTopLevelDomain, ExtractLinks, RecordLoader}
import org.warcbase.spark.rdd.RecordRDD._

RecordLoader.loadArc("/mnt/vol1/data_sets/cpp_arcs/", sc)
  .discardDate(null)
  .keepMimeTypes(Set("text/html"))
  .map(r => (r.getCrawldate, ExtractLinks(r.getUrl, r.getContentString)))
  .flatMap(r => r._2.map(f => (r._1.substring(0,6), ExtractTopLevelDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractTopLevelDomain(f._2).replaceAll("^\\s*www\\.", ""))))
  .filter(r => r._2 != null && r._3 != null)
  .countItems()
  .filter(r => r._2 > 5)
  .groupBy(_._1._1)
  .flatMap(r => r._2)
  .saveAsTextFile("cpp.sitelinks-groupedByMonth")

The critical changes are line 9 (the addition of .substring(0,6) selects the first 6 characters – YYYYMM – of the date string) and lines 13 and 14 (groupBy() groups the data by the date string).

If you're going to try to run this on the cluster, as I did, using all the ARC files, you're going to have to fiddle with --executor-memory. I kept running out of heap space.

The results look like this:

((200701,policyalternatives.ca,policyalternatives.ca),6234276)
((200701,fairvotecanada.org,fairvotecanada.org),908615)
((200701,greenparty.ca,contact.greenparty.ca),150519)
((200701,conservative.ca,conservative.ca),119375)
((200701,greenparty.ca,secure.greenparty.ca),100371)
((200701,greenparty.ca,ridings.greenparty.ca),100360)
((200701,policyalternatives.ca,),97147)
((200701,policyalternatives.ca,pencilneck.net),96190)
((200701,policyalternatives.ca,raisedeyebrow.com),96190)
((200701,ndp.ca,ndp.ca),89337)
((200701,egale.ca,egale.ca),74661)
((200701,policyalternatives.ca,adobe.com),55703)
((200701,greenparty.ca,community.greenparty.ca),50308)
((200701,greenparty.ca,greenparty.ca),50238)
((200701,greenparty.ca,web.greenparty.ca),50212)
((200701,greenparty.ca,partivert.ca),50176)
((200701,greenparty.ca,validator.w3.org),50171)
((200701,davidsuzuki.org,davidsuzuki.org),40867)
[etc.]
lintool commented 8 years ago

BTW, if you don't want the parens, you can write a final map that converts into a string, something like

.map( r => r._1._1 + "\t" + r._1._2 + "\t" + r._1._3 + "\t" + r._2)