ukwa / ukwa-heritrix

The UKWA Heritrix3 custom modules and Docker builder.
9 stars 7 forks source link

Support mildly malformed and compressed Sitemaps #44

Closed anjackson closed 5 years ago

anjackson commented 5 years ago

Running in stage/pre-prod and seeing some sitemaps that we can't parse.

The GZip one will require content sniffing, which maybe supported already by crawler-commons. The XML parser seems to be a bit overzealous here. Maybe it can be made more forgiving?

anjackson commented 5 years ago

Hmm, not convinced these XML errors are genuine. Need to work elsewhere in the stack to make sure we can inspect what the crawler downloaded. Maybe we got a BLOCKED response.

anjackson commented 5 years ago

Which reminds me, maybe:

anjackson commented 5 years ago

Yes, suspect those errors were mostly because of redirect. Currently simplifying the code to rely on Crawler Commons to do most of the heavy lifting.

anjackson commented 5 years ago

Rolling a 2.6.2 release and will run this new version in to check all is well.