lintool / warcbase

Warcbase is an open-source platform for managing analyzing web archives
http://warcbase.org/
161 stars 47 forks source link

Redo Documentation to Account for getContentString, getContentBytes, etc. #184

Closed ianmilligan1 closed 8 years ago

ianmilligan1 commented 8 years ago

I've just redone the "Analysis of Site Link Structure" walkthrough in the docs to account for our API revisions. Currently, all these scripts will crash as they're referring to depreciated code.

Will doublecheck that it works and can do others, such as http://lintool.github.io/warcbase-docs/Spark-Extracting-Domain-Level-Plain-Text/.

ianmilligan1 commented 8 years ago

Keep live as in #185 - might be related. We should make sure to update docs every time we change the API, or things get entangled (at least for this historian). :smile:

ianmilligan1 commented 8 years ago

Just redid Extracting Plain Text to reflect some changes in the API, and tested all three scripts. I think we're good to go on this.

ianmilligan1 commented 8 years ago

Before I put this into the documents, is this a correct use of ExtractBoilerpipeText? It appears to be, although it leaves list() when the output should be null.

import org.warcbase.spark.matchbox.{RemoveHTML, RecordLoader, ExtractBoilerpipeText}
import org.warcbase.spark.rdd.RecordRDD._

RecordLoader.loadArc("/home/i2millig/WAHR/sample-data/arc-warc/ARCHIVEIT-227-UOFTORONTO-CANPOLPINT-20060622205612-00009-crawling025.archive.org.arc.gz", sc)
  .keepValidPages()
  .keepDomains(Set("greenparty.ca"))
  .map(r => (r.getCrawldate, r.getDomain, r.getUrl, ExtractBoilerpipeText(r.getContentString)))
  .saveAsTextFile("/home/i2millig/test-output4")

Am getting the hang of this. This book on Scala is proving very useful, and I'll put a link to it in our docs too.

lintool commented 8 years ago

Looks reasonable.

ianmilligan1 commented 8 years ago

Great, incorporated here.