Closed anjackson closed 5 years ago
Hmm, not convinced these XML errors are genuine. Need to work elsewhere in the stack to make sure we can inspect what the crawler downloaded. Maybe we got a BLOCKED response.
Which reminds me, maybe:
Yes, suspect those errors were mostly because of redirect. Currently simplifying the code to rely on Crawler Commons to do most of the heavy lifting.
Rolling a 2.6.2 release and will run this new version in to check all is well.
Running in stage/pre-prod and seeing some sitemaps that we can't parse.
The GZip one will require content sniffing, which maybe supported already by crawler-commons. The XML parser seems to be a bit overzealous here. Maybe it can be made more forgiving?