Closed ianmilligan1 closed 8 years ago
Keep live as in #185 - might be related. We should make sure to update docs every time we change the API, or things get entangled (at least for this historian). :smile:
Just redid Extracting Plain Text to reflect some changes in the API, and tested all three scripts. I think we're good to go on this.
Before I put this into the documents, is this a correct use of ExtractBoilerpipeText
? It appears to be, although it leaves list()
when the output should be null.
import org.warcbase.spark.matchbox.{RemoveHTML, RecordLoader, ExtractBoilerpipeText}
import org.warcbase.spark.rdd.RecordRDD._
RecordLoader.loadArc("/home/i2millig/WAHR/sample-data/arc-warc/ARCHIVEIT-227-UOFTORONTO-CANPOLPINT-20060622205612-00009-crawling025.archive.org.arc.gz", sc)
.keepValidPages()
.keepDomains(Set("greenparty.ca"))
.map(r => (r.getCrawldate, r.getDomain, r.getUrl, ExtractBoilerpipeText(r.getContentString)))
.saveAsTextFile("/home/i2millig/test-output4")
Am getting the hang of this. This book on Scala is proving very useful, and I'll put a link to it in our docs too.
Looks reasonable.
Great, incorporated here.
I've just redone the "Analysis of Site Link Structure" walkthrough in the docs to account for our API revisions. Currently, all these scripts will crash as they're referring to depreciated code.
Will doublecheck that it works and can do others, such as http://lintool.github.io/warcbase-docs/Spark-Extracting-Domain-Level-Plain-Text/.