Closed ianmilligan1 closed 8 years ago
Would you prefer glob-style pattern matching, or regex? I'm thinking glob.
Good question - I think for our purposes glob would be sufficient (and probably more intuitive).
In the branch keep-wildcard
, RecordRDD now contains a method called keepUrlPatterns
. It takes a Set of one or more regular expressions that are matched against the URL. Note that the patterns in the set are not simple strings -- the .r method at end of the string make it in a regex.
Example:
import org.warcbase.spark.matchbox.{ExtractTopLevelDomain, ExtractLinks, RecordLoader}
import org.warcbase.spark.rdd.RecordRDD._
RecordLoader.loadWarc("example.warc.gz", sc)
.keepValidPages()
.keepUrlPatterns(Set("https://www.greenparty.ca/fr/.*".r))
.map(r => (r.getCrawldate, ExtractLinks(r.getUrl, r.getContentString)))
.flatMap(r => r._2.map(f => (r._1, f._1, f._2)))
.filter(r => r._2 != "" && r._3 != "")
.countItems()
.saveAsTextFile("links")
Despite our comments above, I chose to go with an idiomatic regex. It's more powerful, and simple enough if users are just matching .*
. It would be simple enough to do glob matching instead.
Similarly to #196, this looks great - let's document and get it into master?
Should we write an equivalent of
keepUrls
that supports wildcarding?i.e. in:
Could a function similar to
keepUrls
there keep only URLs that contained the string "evelien_schillern
" in it (in any capacity).