lintool / warcbase

Warcbase is an open-source platform for managing analyzing web archives
http://warcbase.org/
161 stars 47 forks source link

Wildcard support in KeepUrls? #197

Closed ianmilligan1 closed 8 years ago

ianmilligan1 commented 8 years ago

Should we write an equivalent of keepUrls that supports wildcarding?

i.e. in:

RecordLoader.loadWarc("/Users/ianmilligan1/desktop/local-geocities/GEOCITIES-20090808053931-04289-crawling08.us.archive.org.warc.gz", sc)
  .keepValidPages()
  .keepDomains(Set("geocities.com"))
  .keepUrls(Set("*evelien_schillern*"))
  .map(r => (r.getCrawldate, ExtractLinks(r.getUrl, r.getContentString)))
  .flatMap(r => r._2.map(f => (f._1.replaceAll("^\\s*www\\.", ""),f._2.replaceAll("^\\s*www\\.", ""))))
  .take(10)

Could a function similar to keepUrls there keep only URLs that contained the string "evelien_schillern" in it (in any capacity).

jrwiebe commented 8 years ago

Would you prefer glob-style pattern matching, or regex? I'm thinking glob.

ianmilligan1 commented 8 years ago

Good question - I think for our purposes glob would be sufficient (and probably more intuitive).

jrwiebe commented 8 years ago

In the branch keep-wildcard, RecordRDD now contains a method called keepUrlPatterns. It takes a Set of one or more regular expressions that are matched against the URL. Note that the patterns in the set are not simple strings -- the .r method at end of the string make it in a regex.

Example:

import org.warcbase.spark.matchbox.{ExtractTopLevelDomain, ExtractLinks, RecordLoader}
import org.warcbase.spark.rdd.RecordRDD._

RecordLoader.loadWarc("example.warc.gz", sc)
  .keepValidPages()
  .keepUrlPatterns(Set("https://www.greenparty.ca/fr/.*".r))
  .map(r => (r.getCrawldate, ExtractLinks(r.getUrl, r.getContentString)))
  .flatMap(r => r._2.map(f => (r._1, f._1, f._2)))
  .filter(r => r._2 != "" && r._3 != "")
  .countItems()
  .saveAsTextFile("links")

Despite our comments above, I chose to go with an idiomatic regex. It's more powerful, and simple enough if users are just matching .*. It would be simple enough to do glob matching instead.

ianmilligan1 commented 8 years ago

Similarly to #196, this looks great - let's document and get it into master?