Selecting Pages that Contain Certain Keywords

ianmilligan1 commented 8 years ago

We've run into this research question twice now, both within our team and also from another researcher who e-mailed out of the blue.

With a corpus of WARC files, how can we find all pages containing a given keyword to generate a corpus. For example, a plain text corpus of all pages containing the keyword "hamburgers," or pages containing any of the terms "cheese", "pickle", "tomato",etc.

Right now, we're doing it with grep after plain text extraction, but baking this into warcbase and enhancing our Extracting Plain Text functions would be good. It would be great to have some Spark scripts allowing for this.

jrwiebe commented 8 years ago

I've added this to the branch filter-content. There are two methods, keepContent() and discardContent(). Each takes a set of Regex's, in the style of keepUrlPatterns(). For instance, to use your example:

import org.warcbase.spark.matchbox._ 
import org.warcbase.spark.rdd.RecordRDD._ 

val r = RecordLoader.loadArchives("/path/to/warcs/")
keepValidPages()
.keepContent(Set("cheese".r, "pickle".r,  "tomato".r))
.map(r => r.getUrl)
.take(3)

It works as you requested: a page is a match if it contains any of the search terms specified. discardContent() likewise discards a page if it matches any of the terms.

The search words are all regular expressions, hence the .r suffix, so they can be much more complex than the example.

Before merging this we might want to consider the method names (I used keepContent as a partial parallel to getContentString and getContentBytes). Also, are you satisfied using regular expressions, or does this seem too complicated?

ianmilligan1 commented 8 years ago

My $0.02: I like keepContent as it fits with our current language. And I think regular expresses sound good. If people just want simple keyword matching, what you have above works.. and if they want to do amazing regex-fu, we've got them covered.

I'm happy to see this included. Fantastic work!

ianmilligan1 commented 8 years ago

OK testing this branch on camalon:

import org.warcbase.spark.matchbox._ 
import org.warcbase.spark.rdd.RecordRDD._ 

val r = RecordLoader.loadArchives("/collections/webarchives/geocities/warcs/")
.keepValidPages()
.keepContent(Set("guestbooks".r))
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString)))
.saveAsTextFile("guestbooks-text-geocities/")

ianmilligan1 commented 8 years ago

Merged with #228!

lintool / warcbase

Selecting Pages that Contain Certain Keywords #202