Closed ianmilligan1 closed 8 years ago
@jrwiebe has developed keepContent() and discardContent() commands for working with text. I have tested it in the following script:
keepContent()
discardContent()
import org.warcbase.spark.matchbox._ import org.warcbase.spark.rdd.RecordRDD._ val r = RecordLoader.loadArchives("/path/to/warc",sc) .keepValidPages() .keepContent(Set("guestbooks".r)) .map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString))) .saveAsTextFile("out-guestbooks/")
This is a much in-demand program, so I'm doing this pull request to merge it all together.
Just running Travis CI checks.
@jrwiebe has developed
keepContent()
anddiscardContent()
commands for working with text. I have tested it in the following script:This is a much in-demand program, so I'm doing this pull request to merge it all together.