lintool / warcbase

Warcbase is an open-source platform for managing analyzing web archives
http://warcbase.org/
161 stars 47 forks source link

Add Spark RDD keepValidPages transformation #156

Closed lintool closed 8 years ago

lintool commented 8 years ago

We start off our scripts like:

val r = RecordLoader.loadArc("/path/to/files", sc)
  .keepMimeTypes(Set("text/html"))
  .discardDate(null)

Which has basically become an idiom. We should write a keepValidPages transformation that combines keepMimeTypes and discardDate. We can then make it a little smarter: