Closed rubmz closed 2 months ago
I cannot reproduce, the spider works just fine for me:
(...)
2024-09-10 15:48:05 [scrapy.core.engine] INFO: Spider opened
2024-09-10 15:48:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-09-10 15:48:05 [scrapy-playwright] INFO: Starting download handler
2024-09-10 15:48:05 [scrapy-playwright] INFO: Starting download handler
2024-09-10 15:48:05 [scrapy-playwright] INFO: Launching 2 startup context(s)
2024-09-10 15:48:05 [scrapy-playwright] INFO: Launching browser webkit
2024-09-10 15:48:05 [scrapy-playwright] INFO: Launching 2 startup context(s)
2024-09-10 15:48:05 [scrapy-playwright] INFO: Launching browser webkit
2024-09-10 15:48:05 [scrapy-playwright] INFO: Browser webkit launched
2024-09-10 15:48:05 [scrapy-playwright] INFO: Browser webkit launched
2024-09-10 15:48:05 [scrapy-playwright] DEBUG: Browser context started: 'IPHONE_12_MINI' (persistent=False, remote=False)
2024-09-10 15:48:05 [scrapy-playwright] DEBUG: Browser context started: 'IPHONE_12_MINI' (persistent=False, remote=False)
2024-09-10 15:48:05 [scrapy-playwright] DEBUG: Browser context started: 'And' (persistent=False, remote=False)
2024-09-10 15:48:05 [scrapy-playwright] INFO: Startup context(s) launched
2024-09-10 15:48:05 [scrapy-playwright] DEBUG: Browser context started: 'And' (persistent=False, remote=False)
2024-09-10 15:48:05 [scrapy-playwright] INFO: Startup context(s) launched
2024-09-10 15:48:10 [scrapy-playwright] DEBUG: Browser context started: 'default' (persistent=False, remote=False)
2024-09-10 15:48:10 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
2024-09-10 15:48:10 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://example.org/> (resource type: document)
2024-09-10 15:48:11 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://example.org/>
2024-09-10 15:48:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.org> (referer: None) ['playwright']
response: https://example.org/
2024-09-10 15:48:11 [scrapy.core.engine] INFO: Closing spider (finished)
2024-09-10 15:48:11 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 212,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 1602,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 5.874917,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2024, 9, 10, 18, 48, 11, 412118, tzinfo=datetime.timezone.utc),
'log_count/DEBUG': 12,
'log_count/INFO': 18,
'memusage/max': 70029312,
'memusage/startup': 70029312,
'playwright/browser_count': 2,
'playwright/context_count': 5,
'playwright/context_count/max_concurrent': 3,
'playwright/context_count/persistent/False': 5,
'playwright/context_count/remote/False': 5,
'playwright/page_count': 1,
'playwright/page_count/max_concurrent': 1,
'playwright/request_count': 1,
'playwright/request_count/method/GET': 1,
'playwright/request_count/navigation': 1,
'playwright/request_count/resource_type/document': 1,
'playwright/response_count': 1,
'playwright/response_count/method/GET': 1,
'playwright/response_count/resource_type/document': 1,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2024, 9, 10, 18, 48, 5, 537201, tzinfo=datetime.timezone.utc)}
2024-09-10 15:48:11 [scrapy.core.engine] INFO: Spider closed (finished)
2024-09-10 15:48:11 [scrapy-playwright] INFO: Closing download handler
2024-09-10 15:48:11 [scrapy-playwright] DEBUG: Browser context closed: 'IPHONE_12_MINI' (persistent=False, remote=False)
2024-09-10 15:48:11 [scrapy-playwright] DEBUG: Browser context closed: 'And' (persistent=False, remote=False)
2024-09-10 15:48:11 [scrapy-playwright] INFO: Closing browser
2024-09-10 15:48:11 [scrapy-playwright] DEBUG: Browser disconnected
2024-09-10 15:48:11 [scrapy-playwright] INFO: Closing download handler
2024-09-10 15:48:11 [scrapy-playwright] DEBUG: Browser context closed: 'IPHONE_12_MINI' (persistent=False, remote=False)
2024-09-10 15:48:11 [scrapy-playwright] DEBUG: Browser context closed: 'And' (persistent=False, remote=False)
2024-09-10 15:48:11 [scrapy-playwright] DEBUG: Browser context closed: 'default' (persistent=False, remote=False)
2024-09-10 15:48:11 [scrapy-playwright] INFO: Closing browser
2024-09-10 15:48:11 [scrapy-playwright] DEBUG: Browser disconnected
$ scrapy version -v
2024-09-10 15:50:22 [scrapy.utils.log] INFO: Scrapy 2.11.2 started (bot: scraper)
2024-09-10 15:50:22 [scrapy.utils.log] INFO: Versions: lxml 5.2.2.0, libxml2 2.12.6, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.3.0, Python 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0], pyOpenSSL 24.2.1 (OpenSSL 3.3.1 4 Jun 2024), cryptography 43.0.0, Platform Linux-6.5.0-45-generic-x86_64-with-glibc2.35
Scrapy : 2.11.2
lxml : 5.2.2.0
libxml2 : 2.12.6
cssselect : 1.2.0
parsel : 1.9.1
w3lib : 2.2.1
Twisted : 24.3.0
Python : 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0]
pyOpenSSL : 24.2.1 (OpenSSL 3.3.1 4 Jun 2024)
cryptography : 43.0.0
Platform : Linux-6.5.0-45-generic-x86_64-with-glibc2.35
$ pip freeze | grep playwright
playwright==1.46.0
scrapy-playwright==0.0.41
The provided example is not self-contained at all, I had to make several adjustments in order to make it work (dead code, irrelevant settings, missing item classes, pipelines & env variables, etc).
Could it be that because the spider is stopped externally by the spawner (or debugger) it leaves something running in the background? It is very much reproducible for some reason with my configuration... After waiting for 5 minutes I can re-run the spider again without a problem.
Hi,
So below is a minimal example of the code I use in my spider (spider.py, settings.py, ). The problem is, that for the first call and the subsequent (until a few seconds pass by) in parse() function the 'playwright_page' in meta is undefined, causing the call to
page = response.meta['playwright_page']
to raise exception. Why does this happen? Is there a service I need to initialize and wait for before starting? Currently I am using 'devkit' in my browser context. But I suspect it is not the case, or is it?os: ubuntu 22.04 python 3.12 packages:
site_scraper_spider.py:
### settings.py:
### spider_runner.py