keepValidPages discards XHTML

lintool / warcbase

Warcbase is an open-source platform for managing analyzing web archives

http://warcbase.org/

161 stars 47 forks source link

keepValidPages discards XHTML #252

Closed anjackson closed 7 years ago

anjackson commented 7 years ago

Just noticed that keepValidPages only checks for 'text/html' and not for 'application/xhtml+xml'. We found there's quite a lot of XHTML in our archives, so discarding those that don't happen to have URLs that end '.html' would have a significant impact on the results.

lintool commented 7 years ago

Good catch, thanks!