Fine-Tuned Link Extraction within Domains

ianmilligan1 commented 8 years ago

Much of our development with warcbase has focused on using large-scale, multi-domain collections. I'm now working within one big domain (geocities.com). Our link scripts need to be updated accordingly. I think we need a few things:

more fine-grained link extraction;
can we modify ExtractTopLevelDomain to allow for limited string matches - i.e. what if I want all outbound links from all pages like so: http://geocities.com/EnchantedForest/*

I've taken a stab at the first, but it's rough.

ianmilligan1 commented 8 years ago

Currently using this cobbled-together script to do link extraction, but it's not as nice as the original:

My version (for a sample of ten):

  RecordLoader.loadWarc("/Users/ianmilligan1/desktop/local-geocities/GEOCITIES-20090808053931-04289-crawling08.us.archive.org.warc.gz", sc)
  .keepValidPages()
  .keepDomains(Set("geocities.com/"))
  .map(r => (r.getCrawldate, ExtractLinks(r.getUrl, r.getContentString)))
  .flatMap(r => r._2.map(f => (f._1.replaceAll("^\\s*www\\.", ""),f._2.replaceAll("^\\s*www\\.", ""))))
  .take(10)

Old version, much more robust on domains:

RecordLoader.loadWarc("/Users/ianmilligan1/desktop/local-geocities/GEOCITIES-20090808053931-04289-crawling08.us.archive.org.warc.gz", sc)
  .keepValidPages()
  .map(r => (r.getCrawldate, ExtractLinks(r.getUrl, r.getContentString)))
  .flatMap(r => r._2.map(f => (r._1, ExtractTopLevelDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractTopLevelDomain(f._2).replaceAll("^\\s*www\\.", ""))))
  .filter(r => r._2 != "" && r._3 != "")
  .countItems()
  .filter(r => r._2 > 5)
  .saveAsTextFile("cpp.sitelinks")

FWIW, this is really my first time working in Scala.

If you have any cycles, @jrwiebe, do you want to investigate this?

jrwiebe commented 8 years ago

I'll work on an extraction method that allows for wildcard URL matching (e.g., http://geocities.com/EnchantedForest/*), as well as a wildcard-enabled keepURLs. Did you have anything else in mind when you said you'd like "more fine-grained link extraction"? I assume you want to be able to generate counts of links from a base URL (like the EnchantedForest community, or maybe specific user pages) to other domains, or base URLs, or individuals pages.

In the meanwhile, with a few changes to the second script, you can generate counts of links between specific URLs.

import org.warcbase.spark.matchbox.{ExtractTopLevelDomain, ExtractLinks, RecordLoader}
import org.warcbase.spark.rdd.RecordRDD._

RecordLoader.loadWarc("/mnt/vol1/data_sets/geocities/warcs/GEOCITIES-20090808053931-04289-crawling08.us.archive.org.warc.gz", sc)
  .keepValidPages()
  .keepDomains(Set("geocities.com"))
  .map(r => (r.getCrawldate, ExtractLinks(r.getUrl, r.getContentString)))
  .flatMap(r => r._2.map(f => (r._1, f._1, f._2)))
  .filter(r => r._2 != "" && r._3 != "")
  .countItems()
  .saveAsTextFile("linkcounts/")

Note that I removed the replaceAll calls, since the regex as written wasn't effective. (It is looking for strings beginning with "www", while most of the URL strings begin with "http").

jrwiebe commented 8 years ago

I think my keepUrlPatterns function allows you to do what you described, @ianmilligan1. Is there anything else you need on this issue?

import org.warcbase.spark.matchbox.{ExtractTopLevelDomain, ExtractLinks, RecordLoader}
import org.warcbase.spark.rdd.RecordRDD._

RecordLoader.loadWarc("/mnt/vol1/data_sets/geocities/warcs/GEOCITIES-20090808053931-04289-crawling08.us.archive.org.warc.gz", sc)
  .keepValidPages()
  .keepUrlPatterns(Set("http://geocities.com/EnchantedForest/*".r))
  .map(r => (r.getCrawldate, ExtractLinks(r.getUrl, r.getContentString)))
  .flatMap(r => r._2.map(f => (r._1, f._1, f._2)))
  .filter(r => r._2 != "" && r._3 != "")
  .countItems()
  .saveAsTextFile("linkcounts/")

ianmilligan1 commented 8 years ago

Nope, this is perfect. I think we can close this and #197. Do you want to write-up in the docs?

jrwiebe commented 8 years ago

Will do.

lintool / warcbase

Fine-Tuned Link Extraction within Domains #196