lintool / warcbase

Warcbase is an open-source platform for managing analyzing web archives
http://warcbase.org/
161 stars 47 forks source link

Make keepValidPages a bit smarter #163

Closed lintool closed 8 years ago

lintool commented 8 years ago

keepValidPages currently just does this:

rdd.discardDate(null).keepMimeTypes(Set("text/html"))

Can we make it a bit smarter? For example:

aliceranzhou commented 8 years ago

marking as done