lintool / warcbase

Warcbase is an open-source platform for managing analyzing web archives
http://warcbase.org/
161 stars 47 forks source link

Translate `DetectLanguage` pig script into Scala; Incorporate into RecordRDD? #190

Closed ianmilligan1 closed 8 years ago

ianmilligan1 commented 8 years ago

Under pig, we had this script:

register 'target/warcbase-0.1.0-SNAPSHOT-fatjar.jar';

DEFINE ArcLoader org.warcbase.pig.ArcLoader();
DEFINE DetectLanguage org.warcbase.pig.piggybank.DetectLanguage();
DEFINE ExtractRawText org.warcbase.pig.piggybank.ExtractRawText();
DEFINE ExtractTopLevelDomain org.warcbase.pig.piggybank.ExtractTopLevelDomain();

raw = load '/shared/collections/CanadianPoliticalParties/arc/' using ArcLoader as
  (url: chararray, date: chararray, mime: chararray, content: bytearray);

a = filter raw by mime == 'text/html' and date is not null;
b = foreach a generate SUBSTRING(date, 0, 6) as date,
                       REPLACE(ExtractTopLevelDomain(url), '^\\s*www\\.', '') as url, content;
c = filter b by url == 'greenparty.ca';
d = foreach c generate date, url, ExtractRawText((chararray) content) as text;
e = foreach d generate date, url, DetectLanguage(text) as lang, text;

store e into 'cpp.text-greenparty';

Could/should we bake our DetectLanguage function into RecordRDD.scala - i.e. alongside keepMimeTypes or keepDomains? That would fit best with our plain text scripts, so for example this could work (added a fictional line after keepDomains and before the map function).

import org.warcbase.spark.matchbox.{RemoveHTML, RecordLoader}
import org.warcbase.spark.rdd.RecordRDD._

RecordLoader.loadArc("src/test/resources/arc/example.arc.gz", sc)
  .keepValidPages()
  .keepDomains(Set("greenparty.ca"))
  .keepLanguages(Set("en"))
  .map(r => (r.getCrawldate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString)))
  .saveAsTextFile("out/")

I created a new branch, keep-languages, but realize I should check if this is feasible first (since we have largely used Record.RDD for URL and mime-type filtering. If anybody has time or ability to tackle this, please use that branch.

jrwiebe commented 8 years ago

Looks doable.

ianmilligan1 commented 8 years ago

Cool - I've been able to call it in Spark Notebook as a complement to current scripts, but a quick way to filter on language would be a real boon.

jrwiebe commented 8 years ago

Done. Note that language detection is more resource intensive than getting the mimetype, date, URL, or domain of a record, so if you are also filtering by these characteristics, do those first.

Sample keepLanguage usage:


import org.warcbase.spark.matchbox.{RecordLoader, RemoveHTML}
import org.warcbase.spark.rdd.RecordRDD._

RecordLoader.loadWarc("/path/to/warc", sc)
.keepValidPages()
.keepDomains(Set("greenparty.ca"))
.keepLanguages(Set("fr"))
.map(r => (r.getCrawldate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString)))
.saveAsTextFile("out-fr/")
ianmilligan1 commented 8 years ago

Great, thanks @jrwiebe! Will write up and document over in warcbase-docs.