val r = RecordLoader.loadArc("/path/to/files", sc)
.keepMimeTypes(Set("text/html"))
.discardDate(null)
Which has basically become an idiom. We should write a keepValidPages transformation that combines keepMimeTypes and discardDate. We can then make it a little smarter:
throw out robots.txt
keep page if it ends in .htm or .html even if the MIME type isn't correct
We start off our scripts like:
Which has basically become an idiom. We should write a
keepValidPages
transformation that combineskeepMimeTypes
anddiscardDate
. We can then make it a little smarter:robots.txt
.htm
or.html
even if the MIME type isn't correct