Closed ianmilligan1 closed 8 years ago
Currently using this cobbled-together script to do link extraction, but it's not as nice as the original:
My version (for a sample of ten):
RecordLoader.loadWarc("/Users/ianmilligan1/desktop/local-geocities/GEOCITIES-20090808053931-04289-crawling08.us.archive.org.warc.gz", sc)
.keepValidPages()
.keepDomains(Set("geocities.com/"))
.map(r => (r.getCrawldate, ExtractLinks(r.getUrl, r.getContentString)))
.flatMap(r => r._2.map(f => (f._1.replaceAll("^\\s*www\\.", ""),f._2.replaceAll("^\\s*www\\.", ""))))
.take(10)
Old version, much more robust on domains:
RecordLoader.loadWarc("/Users/ianmilligan1/desktop/local-geocities/GEOCITIES-20090808053931-04289-crawling08.us.archive.org.warc.gz", sc)
.keepValidPages()
.map(r => (r.getCrawldate, ExtractLinks(r.getUrl, r.getContentString)))
.flatMap(r => r._2.map(f => (r._1, ExtractTopLevelDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractTopLevelDomain(f._2).replaceAll("^\\s*www\\.", ""))))
.filter(r => r._2 != "" && r._3 != "")
.countItems()
.filter(r => r._2 > 5)
.saveAsTextFile("cpp.sitelinks")
FWIW, this is really my first time working in Scala.
If you have any cycles, @jrwiebe, do you want to investigate this?
I'll work on an extraction method that allows for wildcard URL matching (e.g., http://geocities.com/EnchantedForest/*
), as well as a wildcard-enabled keepURLs
. Did you have anything else in mind when you said you'd like "more fine-grained link extraction"? I assume you want to be able to generate counts of links from a base URL (like the EnchantedForest community, or maybe specific user pages) to other domains, or base URLs, or individuals pages.
In the meanwhile, with a few changes to the second script, you can generate counts of links between specific URLs.
import org.warcbase.spark.matchbox.{ExtractTopLevelDomain, ExtractLinks, RecordLoader}
import org.warcbase.spark.rdd.RecordRDD._
RecordLoader.loadWarc("/mnt/vol1/data_sets/geocities/warcs/GEOCITIES-20090808053931-04289-crawling08.us.archive.org.warc.gz", sc)
.keepValidPages()
.keepDomains(Set("geocities.com"))
.map(r => (r.getCrawldate, ExtractLinks(r.getUrl, r.getContentString)))
.flatMap(r => r._2.map(f => (r._1, f._1, f._2)))
.filter(r => r._2 != "" && r._3 != "")
.countItems()
.saveAsTextFile("linkcounts/")
Note that I removed the replaceAll
calls, since the regex as written wasn't effective. (It is looking for strings beginning with "www", while most of the URL strings begin with "http").
I think my keepUrlPatterns
function allows you to do what you described, @ianmilligan1. Is there anything else you need on this issue?
import org.warcbase.spark.matchbox.{ExtractTopLevelDomain, ExtractLinks, RecordLoader}
import org.warcbase.spark.rdd.RecordRDD._
RecordLoader.loadWarc("/mnt/vol1/data_sets/geocities/warcs/GEOCITIES-20090808053931-04289-crawling08.us.archive.org.warc.gz", sc)
.keepValidPages()
.keepUrlPatterns(Set("http://geocities.com/EnchantedForest/*".r))
.map(r => (r.getCrawldate, ExtractLinks(r.getUrl, r.getContentString)))
.flatMap(r => r._2.map(f => (r._1, f._1, f._2)))
.filter(r => r._2 != "" && r._3 != "")
.countItems()
.saveAsTextFile("linkcounts/")
Nope, this is perfect. I think we can close this and #197. Do you want to write-up in the docs?
Will do.
Much of our development with warcbase has focused on using large-scale, multi-domain collections. I'm now working within one big domain (
geocities.com
). Our link scripts need to be updated accordingly. I think we need a few things:ExtractTopLevelDomain
to allow for limited string matches - i.e. what if I want all outbound links from all pages like so:http://geocities.com/EnchantedForest/*
I've taken a stab at the first, but it's rough.