Closed FuturaTino closed 4 months ago
Same issue for me, on windows 10, same components versions, seems that browser is not launched at all, if I add PLAYWRIGHT_LAUNCH_OPTIONS = {"headless": False}
I see no browser
Same issue for me, on windows 10, same components versions, seems that browser is not launched at all, if I add
PLAYWRIGHT_LAUNCH_OPTIONS = {"headless": False}
I see no browser
The bug is not fixed yet, you can try python-playwright
as a alternative solution to develop your spider program.
I have been developing the package and testing crawls on Windows 11, installing packages with pip
in a virtual environment (without Conda). I do not currently have access to a Windows 10 system, however the CI Windows build is running on windows-latest
: according to the most recent docs on the images this is the same as windows-2022
, which includes Windows 10.0.20348 Build 2529.
There was an earlier issue that involved Conda as well: #244. I could not reproduce it without Conda, given that and the above I'm inclined to believe that Conda might be the issue here.
Please try again without Conda to see if the issue is resolved.
I have been developing the package and testing crawls on Windows 11, installing packages with
pip
in a virtual environment (without Conda). I do not currently have access to a Windows 10 system, however the CI Windows build is running onwindows-latest
: according to the most recent docs on the images this is the same aswindows-2022
, which includes Windows 10.0.20348 Build 2529. There was an earlier issue that involved Conda as well: #244. I could not reproduce it without Conda, given that and the above I'm inclined to believe that Conda might be the issue here. Please try again without Conda to see if the issue is resolved.
Thank you for your help. It looks like not the problem of conda in my situation. I closed the conda environment and used the default python.exe on my computer, then downloaded the scrapy-playwright library. Even after ensuring that I'm running the program with the default python.exe rather than the python.exe from the conda environment, the issue persists.
D:\Repo\crawl>C:/Python312/python.exe d:/Repo/crawl/main.py
2024-07-09 01:59:19 [scrapy.utils.log] INFO: Scrapy 2.11.2 started (bot: scrapybot)
2024-07-09 01:59:19 [scrapy.utils.log] INFO: Versions: lxml 5.2.2.0, libxml2 2.11.7, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.3.0, Python 3.12.0 (tags/v3.12.0:0fb18b0, Oct 2 2023, 13:03:39) [MSC v.1935 64 bit (AMD64)], pyOpenSSL 24.1.0 (OpenSSL 3.2.2 4 Jun 2024), cryptography 42.0.8, Platform Windows-10-10.0.19045-SP0
2024-07-09 01:59:19 [scrapy.addons] INFO: Enabled addons:
[]
2024-07-09 01:59:19 [py.warnings] WARNING: C:\Python312\Lib\site-packages\scrapy\utils\request.py:254: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.
It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in
a future version of Scrapy.
See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
return cls(crawler)
2024-07-09 01:59:19 [asyncio] DEBUG: Using selector: SelectSelector
2024-07-09 01:59:19 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2024-07-09 01:59:19 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop
2024-07-09 01:59:19 [scrapy.extensions.telnet] INFO: Telnet Password: e6ffaca82a959077
2024-07-09 01:59:19 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2024-07-09 01:59:19 [scrapy.crawler] INFO: Overridden settings:
{'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2024-07-09 01:59:19 [asyncio] DEBUG: Using proactor: IocpProactor
2024-07-09 01:59:19 [scrapy-playwright] INFO: Started loop on separate thread: <ProactorEventLoop running=True closed=False debug=False>
2024-07-09 01:59:19 [asyncio] DEBUG: Using proactor: IocpProactor
2024-07-09 01:59:19 [scrapy-playwright] INFO: Started loop on separate thread: <ProactorEventLoop running=True closed=False debug=False>
2024-07-09 01:59:20 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-07-09 01:59:20 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-07-09 01:59:20 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2024-07-09 01:59:20 [scrapy.core.engine] INFO: Spider opened
2024-07-09 01:59:20 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-07-09 01:59:20 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-07-09 02:00:20 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-07-09 02:01:20 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-07-09 02:02:20 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-07-09 02:03:20 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
I create a venv enviroment and run the code. The problem still persists
I was wrong about Conda, I was able to reproduce the issue in Windows 11 as well. What I found is that the spider works correctly if only one of the schemas is set in the DOWNLOAD_HANDLERS
settings, e.g.:
"DOWNLOAD_HANDLERS": {
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
# "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},
Fortunately plaintext http is rare nowadays, so only enabling the handler for the https
schema should be enough for most sites.
I do not know why this is happening, I will need to check in more depth.
I was wrong about Conda, I was able to reproduce the issue in Windows 11 as well. What I found is that the spider works correctly if only one of the schemas is set in the
DOWNLOAD_HANDLERS
settings, e.g.:"DOWNLOAD_HANDLERS": { "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler", # "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler", },
Fortunately plaintext http is rare nowadays, so only enabling the handler for the
https
schema should be enough for most sites. I do not know why this is happening, I will need to check in more depth.
Use conda and set single https schema ❌
Dont use conda and set single https schema ✅
Yes, you are right. I use my default python.exe
(venv also works) and delete the http schema and the spider runs successfully.
The bug has been circumvented, but the cause remains unknown.
If I have said anything impolite, please forgive me, as I have used a translator. My basic situation is the same as this issue Spiders hangs. I locate the problem to this setting item.
Code
When I comment out
DOWNLOAD_HANDLERS
, the spider runs normally. Here is my code, you can directly run it and try.Result
The log continues to repeat "[scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)".
Conda Environment
System
Windows 10
Guess
Is it possible that it is related to the computer system? I currently do not have a computer with a Linux system and cannot verify this. Besides, I am in China, could it be related to the IP address? However, I checked the
ScrapyPlaywrightDownloadHandler
class and did not see any bugs. Thank you for your help.