lintool / warcbase

Warcbase is an open-source platform for managing analyzing web archives
http://warcbase.org/
161 stars 47 forks source link

Adding keepContent to warcbase #228

Closed ianmilligan1 closed 8 years ago

ianmilligan1 commented 8 years ago

@jrwiebe has developed keepContent() and discardContent() commands for working with text. I have tested it in the following script:

import org.warcbase.spark.matchbox._ 
import org.warcbase.spark.rdd.RecordRDD._ 

val r = RecordLoader.loadArchives("/path/to/warc",sc)
.keepValidPages()
.keepContent(Set("guestbooks".r))
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString)))
.saveAsTextFile("out-guestbooks/")

This is a much in-demand program, so I'm doing this pull request to merge it all together.

ianmilligan1 commented 8 years ago

Just running Travis CI checks.