Support mildly malformed and compressed Sitemaps

anjackson commented 5 years ago

Running in stage/pre-prod and seeing some sitemaps that we can't parse.

Compressed ones like https://www.ebay.co.uk/lst/GTC-3-06-04-2019_6.xml.gz
Content not allowed in prolog: http://www.bbc.co.uk/mobile_sitemap.xml
White spaces are required between publicId and systemId. http://www.bbc.co.uk/ukchina/simp/sitemap.xml See Stack Overflow

The GZip one will require content sniffing, which maybe supported already by crawler-commons. The XML parser seems to be a bit overzealous here. Maybe it can be made more forgiving?

anjackson commented 5 years ago

Hmm, not convinced these XML errors are genuine. Need to work elsewhere in the stack to make sure we can inspect what the crawler downloaded. Maybe we got a BLOCKED response.

anjackson commented 5 years ago

Which reminds me, maybe:

[x] Check that we only run the sitemap parser if the FetchStatus is 200

anjackson commented 5 years ago

Yes, suspect those errors were mostly because of redirect. Currently simplifying the code to rely on Crawler Commons to do most of the heavy lifting.

anjackson commented 5 years ago

Rolling a 2.6.2 release and will run this new version in to check all is well.

ukwa / ukwa-heritrix

Support mildly malformed and compressed Sitemaps #44