Open seagatesoft opened 1 month ago
It might be worth it to find out why the earlier if isinstance(response, XmlResponse):
did not work for those, though. I suspect https://github.com/scrapy/scrapy/pull/5204 might help here.
I am not able to reproduce this locally on plain scrapy v.2.11.0
In this case response objects from all mentioned urls that reached to _get_sitemap_body
method identified as scrapy.http.response.xml.XmlResponse
which means that original _get_sitemap_body
method from sitemap spider should identifiy responses as valid sitemap from condition if isinstance(response, XmlResponse):
https://github.com/scrapy/scrapy/blob/2f1d345e74d19e33016f9e69fcda0bda9afb568d/scrapy/spiders/sitemap.py#L88-L93
before https://github.com/scrapy/scrapy/blob/2f1d345e74d19e33016f9e69fcda0bda9afb568d/scrapy/spiders/sitemap.py#L117-L118
It might be worth it to find out why the earlier if isinstance(response, XmlResponse): did not work for those, though. I suspect https://github.com/scrapy/scrapy/pull/5204 might help here.
Originally - scrapy create Response
object as it contains binary compressed data. Later on HttpCompressionMiddleware.process_response
- after decompression response object recreated as XmlResponse
instance compatible with SitemapSpider
https://github.com/scrapy/scrapy/blob/02b97f98e74a994ad3e4d74e7ed55207e508a576/scrapy/downloadermiddlewares/httpcompression.py#L138-L150
Is it possible that the original problem happens on an older Scrapy version or with some SitemapSpider methods overridden? @seagatesoft
Description
Some sitemaps are having URLs with parameters, examples:
The current implementation of
_get_sitemap_body
will fail to detect those URLs as sitemap because it does the following check:if response.url.endswith(".xml") or response.url.endswith(".xml.gz"):
So far I fixed the issue by overriding
_get_sitemap_body
to: