scrapy-plugins / scrapy-playwright

🎭 Playwright integration for Scrapy
BSD 3-Clause "New" or "Revised" License
983 stars 106 forks source link

Windows: crawl freezes when multiple schemas are set in DOWNLOAD_HANDLERS #290

Closed FuturaTino closed 2 months ago

FuturaTino commented 2 months ago

If I have said anything impolite, please forgive me, as I have used a translator. My basic situation is the same as this issue Spiders hangs. I locate the problem to this setting item.

"DOWNLOAD_HANDLERS": {
    "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},

Code

When I comment out DOWNLOAD_HANDLERS, the spider runs normally. Here is my code, you can directly run it and try.

from pathlib import Path
from scrapy.crawler import CrawlerProcess
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            "https://quotes.toscrape.com/page/1/",
            "https://quotes.toscrape.com/page/2/",
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse,
                                  meta={"playwright": True},)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = f"quotes-{page}.html"
        Path(filename).write_bytes(response.body)
        self.log(f"Saved file {filename}")

if __name__=="__main__":
    process = CrawlerProcess(
        settings={
            "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
            "DOWNLOAD_HANDLERS": {
                "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
                "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
            },
            "CONCURRENT_REQUESTS": 32,
        }
    )
    process.crawl(QuotesSpider)
    process.start()

Result

(env4gpt) D:\Repo\crawl>D:/anaconda3/envs/env4gpt/python.exe d:/Repo/crawl/main.py
2024-07-05 19:36:00 [scrapy.utils.log] INFO: Scrapy 2.11.2 started (bot: scrapybot)
2024-07-05 19:36:00 [scrapy.utils.log] INFO: Versions: lxml 5.2.2.0, libxml2 2.11.7, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.0, Twisted 24.3.0, Python 3.9.18 (main, Sep 11 2023, 14:09:26) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 24.1.0 (OpenSSL 3.2.2 4 Jun 2024), cryptography 42.0.8, Platform Windows-10-10.0.19045-SP0
2024-07-05 19:36:00 [scrapy.addons] INFO: Enabled addons:
[]
2024-07-05 19:36:00 [py.warnings] WARNING: D:\anaconda3\envs\env4gpt\lib\site-packages\scrapy\utils\request.py:254: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.

It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy.

See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
  return cls(crawler)

2024-07-05 19:36:00 [asyncio] DEBUG: Using selector: SelectSelector
2024-07-05 19:36:00 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2024-07-05 19:36:00 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop
2024-07-05 19:36:00 [scrapy.extensions.telnet] INFO: Telnet Password: 6b77837564f0d1cc
2024-07-05 19:36:00 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2024-07-05 19:36:00 [scrapy.crawler] INFO: Overridden settings:
{'CONCURRENT_REQUESTS': 32,
 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2024-07-05 19:36:01 [asyncio] DEBUG: Using proactor: IocpProactor
2024-07-05 19:36:01 [scrapy-playwright] INFO: Started loop on separate thread: <ProactorEventLoop 
running=True closed=False debug=False>
2024-07-05 19:36:01 [asyncio] DEBUG: Using proactor: IocpProactor
2024-07-05 19:36:01 [scrapy-playwright] INFO: Started loop on separate thread: <ProactorEventLoop 
running=True closed=False debug=False>
2024-07-05 19:36:01 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-07-05 19:36:01 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-07-05 19:36:01 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2024-07-05 19:36:01 [scrapy.core.engine] INFO: Spider opened
2024-07-05 19:36:01 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-07-05 19:36:01 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023   
2024-07-05 19:37:01 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-07-05 19:38:01 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-07-05 19:39:01 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-07-05 19:40:01 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-07-05 19:41:01 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-07-05 19:42:01 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-07-05 19:43:01 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-07-05 19:44:01 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-07-05 19:45:01 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-07-05 19:46:01 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-07-05 19:47:01 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-07-05 19:48:01 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-07-05 19:49:01 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-07-05 19:50:01 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-07-05 19:51:01 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-07-05 19:52:01 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

The log continues to repeat "[scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)".

Conda Environment

openpyxl=3.1.4
scrapy==2.11.2
lxml==5.2.2
python-dotenv==1.0.1
scrapy-playwright==0.0.37

System

Windows 10

Guess

Is it possible that it is related to the computer system? I currently do not have a computer with a Linux system and cannot verify this. Besides, I am in China, could it be related to the IP address? However, I checked the ScrapyPlaywrightDownloadHandler class and did not see any bugs. Thank you for your help.

DJousto commented 2 months ago

Same issue for me, on windows 10, same components versions, seems that browser is not launched at all, if I add PLAYWRIGHT_LAUNCH_OPTIONS = {"headless": False} I see no browser

FuturaTino commented 2 months ago

Same issue for me, on windows 10, same components versions, seems that browser is not launched at all, if I add PLAYWRIGHT_LAUNCH_OPTIONS = {"headless": False} I see no browser

The bug is not fixed yet, you can try python-playwright as a alternative solution to develop your spider program.

elacuesta commented 2 months ago

I have been developing the package and testing crawls on Windows 11, installing packages with pip in a virtual environment (without Conda). I do not currently have access to a Windows 10 system, however the CI Windows build is running on windows-latest: according to the most recent docs on the images this is the same as windows-2022, which includes Windows 10.0.20348 Build 2529. There was an earlier issue that involved Conda as well: #244. I could not reproduce it without Conda, given that and the above I'm inclined to believe that Conda might be the issue here. Please try again without Conda to see if the issue is resolved.

FuturaTino commented 2 months ago

I have been developing the package and testing crawls on Windows 11, installing packages with pip in a virtual environment (without Conda). I do not currently have access to a Windows 10 system, however the CI Windows build is running on windows-latest: according to the most recent docs on the images this is the same as windows-2022, which includes Windows 10.0.20348 Build 2529. There was an earlier issue that involved Conda as well: #244. I could not reproduce it without Conda, given that and the above I'm inclined to believe that Conda might be the issue here. Please try again without Conda to see if the issue is resolved.

Thank you for your help. It looks like not the problem of conda in my situation. I closed the conda environment and used the default python.exe on my computer, then downloaded the scrapy-playwright library. Even after ensuring that I'm running the program with the default python.exe rather than the python.exe from the conda environment, the issue persists.

D:\Repo\crawl>C:/Python312/python.exe d:/Repo/crawl/main.py
2024-07-09 01:59:19 [scrapy.utils.log] INFO: Scrapy 2.11.2 started (bot: scrapybot)
2024-07-09 01:59:19 [scrapy.utils.log] INFO: Versions: lxml 5.2.2.0, libxml2 2.11.7, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.3.0, Python 3.12.0 (tags/v3.12.0:0fb18b0, Oct  2 2023, 13:03:39) [MSC v.1935 64 bit (AMD64)], pyOpenSSL 24.1.0 (OpenSSL 3.2.2 4 Jun 2024), cryptography 42.0.8, Platform Windows-10-10.0.19045-SP0
2024-07-09 01:59:19 [scrapy.addons] INFO: Enabled addons:
[]
2024-07-09 01:59:19 [py.warnings] WARNING: C:\Python312\Lib\site-packages\scrapy\utils\request.py:254: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.

It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in 
a future version of Scrapy.

See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
  return cls(crawler)

2024-07-09 01:59:19 [asyncio] DEBUG: Using selector: SelectSelector
2024-07-09 01:59:19 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor   
2024-07-09 01:59:19 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop
2024-07-09 01:59:19 [scrapy.extensions.telnet] INFO: Telnet Password: e6ffaca82a959077
2024-07-09 01:59:19 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2024-07-09 01:59:19 [scrapy.crawler] INFO: Overridden settings:
{'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2024-07-09 01:59:19 [asyncio] DEBUG: Using proactor: IocpProactor
2024-07-09 01:59:19 [scrapy-playwright] INFO: Started loop on separate thread: <ProactorEventLoop running=True closed=False debug=False>
2024-07-09 01:59:19 [asyncio] DEBUG: Using proactor: IocpProactor
2024-07-09 01:59:19 [scrapy-playwright] INFO: Started loop on separate thread: <ProactorEventLoop running=True closed=False debug=False>
2024-07-09 01:59:20 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-07-09 01:59:20 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-07-09 01:59:20 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2024-07-09 01:59:20 [scrapy.core.engine] INFO: Spider opened
2024-07-09 01:59:20 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-07-09 01:59:20 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-07-09 02:00:20 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-07-09 02:01:20 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-07-09 02:02:20 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-07-09 02:03:20 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

I create a venv enviroment and run the code. The problem still persists

elacuesta commented 2 months ago

I was wrong about Conda, I was able to reproduce the issue in Windows 11 as well. What I found is that the spider works correctly if only one of the schemas is set in the DOWNLOAD_HANDLERS settings, e.g.:

"DOWNLOAD_HANDLERS": {
    "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    # "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},

Fortunately plaintext http is rare nowadays, so only enabling the handler for the https schema should be enough for most sites. I do not know why this is happening, I will need to check in more depth.

FuturaTino commented 2 months ago

I was wrong about Conda, I was able to reproduce the issue in Windows 11 as well. What I found is that the spider works correctly if only one of the schemas is set in the DOWNLOAD_HANDLERS settings, e.g.:

"DOWNLOAD_HANDLERS": {
    "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    # "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},

Fortunately plaintext http is rare nowadays, so only enabling the handler for the https schema should be enough for most sites. I do not know why this is happening, I will need to check in more depth.

Use conda and set single https schema ❌ Dont use conda and set single https schema ✅ Yes, you are right. I use my default python.exe(venv also works) and delete the http schema and the spider runs successfully. The bug has been circumvented, but the cause remains unknown.

elacuesta commented 2 months ago

Fixed as of v0.0.40.