Just noticed that keepValidPages only checks for 'text/html' and not for 'application/xhtml+xml'. We found there's quite a lot of XHTML in our archives, so discarding those that don't happen to have URLs that end '.html' would have a significant impact on the results.
Just noticed that keepValidPages only checks for 'text/html' and not for 'application/xhtml+xml'. We found there's quite a lot of XHTML in our archives, so discarding those that don't happen to have URLs that end '.html' would have a significant impact on the results.