scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
50.99k stars 10.34k forks source link

SitemapSpider will ignore sitemap with URLs like https://website.com/filename.xml?from=7155352010944&to=7482320519360 #6293

Open seagatesoft opened 1 month ago

seagatesoft commented 1 month ago

Description

Some sitemaps are having URLs with parameters, examples:

  1. https://hwpartstore.com/sitemap_products_8.xml?from=7155352010944&to=7482320519360
  2. https://tornadoparts.com/sitemap_products_1.xml?from=1734178111555&to=1734707675203
  3. https://www.mycnhistore.com/medias/sitemap-product-newhollandag-us-en-main.xml?context=bWFzdGVyfHJvb3R8NTA4OTY0MXx0ZXh0L3htbHxhREppTDJneVpTODVOVGc1TVRVMk9ETTVORFUwTDNOcGRHVnRZWEF0Y0hKdlpIVmpkQzF1Wlhkb2IyeHNZVzVrWVdjdGRYTXRaVzR0YldGcGJpNTRiV3d8ZWM0NDFlMDgwZmYzNTlkYjkzZWIwNGFhYzM0NGNlOWFmMjUzYjBhZWFjYTY3MDg5YjY5NWY1OTE2ODM2MTJjYQ

The current implementation of _get_sitemap_body will fail to detect those URLs as sitemap because it does the following check:

if response.url.endswith(".xml") or response.url.endswith(".xml.gz"):

So far I fixed the issue by overriding _get_sitemap_body to:

def _get_sitemap_body(self, response):
    if response.url.split("?")[0].endswith(".xml"):
        return response.body
    return super()._get_sitemap_body(response)
Gallaecio commented 1 month ago

It might be worth it to find out why the earlier if isinstance(response, XmlResponse): did not work for those, though. I suspect https://github.com/scrapy/scrapy/pull/5204 might help here.

GeorgeA92 commented 1 month ago

I am not able to reproduce this locally on plain scrapy v.2.11.0

script.py ```python import scrapy from scrapy.crawler import CrawlerProcess as Cp class SitemapTestSpider(scrapy.spiders.sitemap.SitemapSpider): name = "quotes" custom_settings = {"DOWNLOAD_DELAY": 1} sitemap_urls = [ 'https://hwpartstore.com/sitemap_products_8.xml?from=7155352010944&to=7482320519360', 'https://tornadoparts.com/sitemap_products_1.xml?from=1734178111555&to=1734707675203', 'https://www.mycnhistore.com/medias/sitemap-product-newhollandag-us-en-main.xml?context=bWFzdGVyfHJvb3R8NTA4OTY0MXx0ZXh0L3htbHxhREppTDJneVpTODVOVGc1TVRVMk9ETTVORFUwTDNOcGRHVnRZWEF0Y0hKdlpIVmpkQzF1Wlhkb2IyeHNZVzVrWVdjdGRYTXRaVzR0YldGcGJpNTRiV3d8ZWM0NDFlMDgwZmYzNTlkYjkzZWIwNGFhYzM0NGNlOWFmMjUzYjBhZWFjYTY3MDg5YjY5NWY1OTE2ODM2MTJjYQ' ] def _get_sitemap_body(self, response): # self.logger.info(f"data for {response.url}") # headers = '\n\t\t'.join([f"{k}:{v}" for k,v in response.headers.items()]) # self.logger.info(f"{headers}") self.logger.info( f"{'!!!' if isinstance(response, scrapy.http.XmlResponse) else ''}" f"{response.url} \n identified as {response.__class__} ") if __name__ == "__main__": proc = Cp(); proc.crawl(SitemapTestSpider); proc.start() ```
log output ``` 2024-03-19 14:07:28 [scrapy.utils.log] INFO: Scrapy 2.11.0 started (bot: scrapybot) 2024-03-19 14:07:28 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.14, cssselect 1.2.0, parsel 1.6.0, w3lib 1.21.0, Twisted 22.10.0, Python 3.10.8 | packaged by conda-forge | (main, Nov 24 2022, 14:07:00) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 23.0.0 (OpenSSL 1.1.1w 11 Sep 2023), cryptography 39.0.1, Platform Windows-10-10.0.22631-SP0 2024-03-19 14:07:28 [scrapy.addons] INFO: Enabled addons: [] 2024-03-19 14:07:28 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor 2024-03-19 14:07:28 [scrapy.extensions.telnet] INFO: Telnet Password: e7ff9d2a81697957 2024-03-19 14:07:28 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.logstats.LogStats'] 2024-03-19 14:07:28 [scrapy.crawler] INFO: Overridden settings: {'DOWNLOAD_DELAY': 1} 2024-03-19 14:07:28 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2024-03-19 14:07:28 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2024-03-19 14:07:28 [scrapy.middleware] INFO: Enabled item pipelines: [] 2024-03-19 14:07:28 [scrapy.core.engine] INFO: Spider opened 2024-03-19 14:07:28 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2024-03-19 14:07:28 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2024-03-19 14:07:29 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None) 2024-03-19 14:07:29 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None) 2024-03-19 14:07:29 [quotes] INFO: !!!https://tornadoparts.com/sitemap_products_1.xml?from=1734178111555&to=1734707675203 identified as 2024-03-19 14:07:29 [scrapy.spiders.sitemap] WARNING: Ignoring invalid sitemap: <200 https://tornadoparts.com/sitemap_products_1.xml?from=1734178111555&to=1734707675203> 2024-03-19 14:07:29 [quotes] INFO: !!!https://hwpartstore.com/sitemap_products_8.xml?from=7155352010944&to=7482320519360 identified as 2024-03-19 14:07:29 [scrapy.spiders.sitemap] WARNING: Ignoring invalid sitemap: <200 https://hwpartstore.com/sitemap_products_8.xml?from=7155352010944&to=7482320519360> 2024-03-19 14:07:30 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None) 2024-03-19 14:07:30 [quotes] INFO: !!!https://www.mycnhistore.com/medias/sitemap-product-newhollandag-us-en-main.xml?context=bWFzdGVyfHJvb3R8NTA4OTY0MXx0ZXh0L3htbHxhREppTDJneVpTODVOVGc1TVRVMk9ETTVORFUwTDNOcGRHVnRZWEF0Y0hKdlpIVmpkQzF1Wlhkb2IyeHNZVzVrWVdjdGRYTXRaVzR0YldGcGJpNTRiV3d8ZWM0NDFlMDgwZmYzNTlkYjkzZWIwNGFhYzM0NGNlOWFmMjUzYjBhZWFjYTY3MDg5YjY5NWY1OTE2ODM2MTJjYQ identified as 2024-03-19 14:07:30 [scrapy.spiders.sitemap] WARNING: Ignoring invalid sitemap: <200 https://www.mycnhistore.com/medias/sitemap-product-newhollandag-us-en-main.xml?context=bWFzdGVyfHJvb3R8NTA4OTY0MXx0ZXh0L3htbHxhREppTDJneVpTODVOVGc1TVRVMk9ETTVORFUwTDNOcGRHVnRZWEF0Y0hKdlpIVmpkQzF1Wlhkb2IyeHNZVzVrWVdjdGRYTXRaVzR0YldGcGJpNTRiV3d8ZWM0NDFlMDgwZmYzNTlkYjkzZWIwNGFhYzM0NGNlOWFmMjUzYjBhZWFjYTY3MDg5YjY5NWY1OTE2ODM2MTJjYQ> 2024-03-19 14:07:30 [scrapy.core.engine] INFO: Closing spider (finished) 2024-03-19 14:07:30 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 1082, 'downloader/request_count': 3, 'downloader/request_method_count/GET': 3, 'downloader/response_bytes': 456034, 'downloader/response_count': 3, 'downloader/response_status_count/200': 3, 'elapsed_time_seconds': 1.430959, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2024, 3, 19, 13, 7, 30, 256888, tzinfo=datetime.timezone.utc), 'httpcompression/response_bytes': 7681561, 'httpcompression/response_count': 3, 'log_count/DEBUG': 4, 'log_count/INFO': 13, 'log_count/WARNING': 3, 'response_received_count': 3, 'scheduler/dequeued': 3, 'scheduler/dequeued/memory': 3, 'scheduler/enqueued': 3, 'scheduler/enqueued/memory': 3, 'start_time': datetime.datetime(2024, 3, 19, 13, 7, 28, 825929, tzinfo=datetime.timezone.utc)} 2024-03-19 14:07:30 [scrapy.core.engine] INFO: Spider closed (finished) Process finished with exit code 0 ```

In this case response objects from all mentioned urls that reached to _get_sitemap_body method identified as scrapy.http.response.xml.XmlResponse which means that original _get_sitemap_body method from sitemap spider should identifiy responses as valid sitemap from condition if isinstance(response, XmlResponse): https://github.com/scrapy/scrapy/blob/2f1d345e74d19e33016f9e69fcda0bda9afb568d/scrapy/spiders/sitemap.py#L88-L93 before https://github.com/scrapy/scrapy/blob/2f1d345e74d19e33016f9e69fcda0bda9afb568d/scrapy/spiders/sitemap.py#L117-L118

It might be worth it to find out why the earlier if isinstance(response, XmlResponse): did not work for those, though. I suspect https://github.com/scrapy/scrapy/pull/5204 might help here.

Originally - scrapy create Response object as it contains binary compressed data. Later on HttpCompressionMiddleware.process_response - after decompression response object recreated as XmlResponse instance compatible with SitemapSpider https://github.com/scrapy/scrapy/blob/02b97f98e74a994ad3e4d74e7ed55207e508a576/scrapy/downloadermiddlewares/httpcompression.py#L138-L150

wRAR commented 1 month ago

Is it possible that the original problem happens on an older Scrapy version or with some SitemapSpider methods overridden? @seagatesoft