Closed ianmilligan1 closed 8 years ago
Looks doable.
Cool - I've been able to call it in Spark Notebook as a complement to current scripts, but a quick way to filter on language would be a real boon.
Done. Note that language detection is more resource intensive than getting the mimetype, date, URL, or domain of a record, so if you are also filtering by these characteristics, do those first.
Sample keepLanguage
usage:
import org.warcbase.spark.matchbox.{RecordLoader, RemoveHTML}
import org.warcbase.spark.rdd.RecordRDD._
RecordLoader.loadWarc("/path/to/warc", sc)
.keepValidPages()
.keepDomains(Set("greenparty.ca"))
.keepLanguages(Set("fr"))
.map(r => (r.getCrawldate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString)))
.saveAsTextFile("out-fr/")
Great, thanks @jrwiebe! Will write up and document over in warcbase-docs.
Under pig, we had this script:
Could/should we bake our
DetectLanguage
function intoRecordRDD.scala
- i.e. alongsidekeepMimeTypes
orkeepDomains
? That would fit best with our plain text scripts, so for example this could work (added a fictional line afterkeepDomains
and before themap
function).I created a new branch, keep-languages, but realize I should check if this is feasible first (since we have largely used
Record.RDD
for URL and mime-type filtering. If anybody has time or ability to tackle this, please use that branch.