Open alosultan opened 1 year ago
Hi, could you provide a minimal, reproducible example? I'm able to run a spider using the CrawlerRunner
as described in the Scrapy docs:
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from twisted.internet.asyncioreactor import install as install_asyncio_reactor
class TestSpider(scrapy.Spider):
name = "example"
custom_settings = {
"DOWNLOAD_HANDLERS": {
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},
}
def start_requests(self):
yield scrapy.Request(url="https://example.org", meta={"playwright": True})
def parse(self, response):
yield {"url": response.url}
if __name__ == "__main__":
install_asyncio_reactor()
from twisted.internet import reactor
configure_logging({"LOG_FORMAT": "%(levelname)s: %(message)s"})
runner = CrawlerRunner()
d = runner.crawl(TestSpider)
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished
$ python examples/reactor.py
INFO: Overridden settings:
{}
2022-10-17 18:36:37 [scrapy.extensions.telnet] INFO: Telnet Password: c1f8e1c8505cbd6f
2022-10-17 18:36:37 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2022-10-17 18:36:37 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-10-17 18:36:37 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-10-17 18:36:37 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-10-17 18:36:37 [scrapy.core.engine] INFO: Spider opened
2022-10-17 18:36:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-10-17 18:36:38 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-10-17 18:36:38 [scrapy-playwright] INFO: Starting download handler
2022-10-17 18:36:43 [scrapy-playwright] INFO: Launching browser chromium
2022-10-17 18:36:43 [scrapy-playwright] INFO: Browser chromium launched
2022-10-17 18:36:43 [scrapy-playwright] DEBUG: Browser context started: 'default' (persistent=False)
2022-10-17 18:36:43 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
2022-10-17 18:36:43 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://example.org/> (resource type: document, referrer: None)
2022-10-17 18:36:44 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://example.org/> (referrer: None)
2022-10-17 18:36:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.org> (referer: None) ['playwright']
2022-10-17 18:36:44 [scrapy.core.scraper] DEBUG: Scraped from <200 https://example.org/>
{'url': 'https://example.org/'}
2022-10-17 18:36:44 [scrapy.core.engine] INFO: Closing spider (finished)
2022-10-17 18:36:44 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 211,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 1600,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 6.194073,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 10, 17, 21, 36, 44, 277270),
'item_scraped_count': 1,
'log_count/DEBUG': 6,
'log_count/INFO': 13,
'memusage/max': 58142720,
'memusage/startup': 58142720,
'playwright/context_count': 1,
'playwright/context_count/max_concurrent': 1,
'playwright/context_count/non-persistent': 1,
'playwright/page_count': 1,
'playwright/page_count/closed': 1,
'playwright/page_count/max_concurrent': 1,
'playwright/request_count': 1,
'playwright/request_count/method/GET': 1,
'playwright/request_count/navigation': 1,
'playwright/request_count/resource_type/document': 1,
'playwright/response_count': 1,
'playwright/response_count/method/GET': 1,
'playwright/response_count/resource_type/document': 1,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2022, 10, 17, 21, 36, 38, 83197)}
2022-10-17 18:36:44 [scrapy.core.engine] INFO: Spider closed (finished)
2022-10-17 18:36:44 [scrapy-playwright] INFO: Closing download handler
2022-10-17 18:36:44 [scrapy-playwright] DEBUG: Browser context closed: 'default' (persistent=False)
2022-10-17 18:36:44 [scrapy-playwright] INFO: Closing browser
I created a Django project "channels-scrapy" with two applications:
myapp.management.commands.crawl.py
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings
from django.core.management.base import BaseCommand
class Command(BaseCommand):
help = 'Runs the specified spider'
def add_arguments(self, parser):
parser.add_argument('spider', type=str, help="The spider name that to be located, instanced, and crawled.")
def handle(self, *args, **options):
# An asyncio Twisted reactor has already installed (AsyncioSelectorReactor object)
from twisted.internet import reactor
configure_logging()
runner = CrawlerRunner(settings=get_project_settings())
d = runner.crawl(options['spider'])
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished
scrapy_app.spiders.py
import scrapy
class TestSpider(scrapy.Spider):
name = "example"
# If you comment these settings, then no problem will appear.
custom_settings = {
"DOWNLOAD_HANDLERS": {
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},
}
def start_requests(self):
yield scrapy.Request(url="https://example.org", meta={"playwright": True})
def parse(self, response, **kwargs):
yield {"url": response.url}
scrapy_app.settings.py
BOT_NAME = 'scrapy_app'
SPIDER_MODULES = ['scrapy_app.spiders']
NEWSPIDER_MODULE = 'scrapy_app.spiders'
ROBOTSTXT_OBEY = True
# No need to this setting. The reactor will be already installed from outside.
# TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
The INSTALLED_APPS list includes "daphne" app as mentioned in the channels documentation and 'myapp'.
INSTALLED_APPS = [
"daphne",
"django.contrib.admin",
"django.contrib.auth",
"django.contrib.contenttypes",
"django.contrib.sessions",
"django.contrib.messages",
"django.contrib.staticfiles",
"myapp",
]
In general Django "channels-scrapy" project looks like this:
./channels-scrapy
├── config
│ ├── __init__.py
│ ├── asgi.py
│ ├── settings.py
│ ├── urls.py
│ └── wsgi.py
├── manage.py
├── myapp
│ ├── __init__.py
│ ├── apps.py
│ ├── management
│ │ ├── __init__.py
│ │ └── commands
│ │ ├── __init__.py
│ │ └── crawl.py
│ ├── migrations
│ │ └── __init__.py
│ └── views.py
├── scrapy.cfg
├── scrapy_app
├── __init__.py
├── items.py
├── middlewares.py
├── pipelines.py
├── settings.py
└── spiders.py
Now when I run the spider example:
python manage.py crawl example
the application freezes and does not continue to work (note the line [asyncio] DEBUG: Using selector: KqueueSelector):
2022-10-31 14:33:10 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'scrapy_app',
'NEWSPIDER_MODULE': 'scrapy_app.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['scrapy_app.spiders']}
2022-10-31 14:33:10 [scrapy.extensions.telnet] INFO: Telnet Password: 394a0b2b4debf964
2022-10-31 14:33:10 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2022-10-31 14:33:10 [asyncio] DEBUG: Using selector: KqueueSelector <-------- it's strange here
2022-10-31 14:33:10 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-10-31 14:33:10 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-10-31 14:33:10 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-10-31 14:33:10 [scrapy.core.engine] INFO: Spider opened
2022-10-31 14:33:10 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-10-31 14:33:10 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
But if you turn off playwright, then everything will work fine and the line [asyncio] DEBUG: Using selector: KqueueSelector will disappear:
2022-10-31 14:45:15 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'scrapy_app',
'NEWSPIDER_MODULE': 'scrapy_app.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['scrapy_app.spiders']}
2022-10-31 14:45:15 [scrapy.extensions.telnet] INFO: Telnet Password: 0cb868371d556578
2022-10-31 14:45:15 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2022-10-31 14:45:15 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-10-31 14:45:15 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-10-31 14:45:15 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-10-31 14:45:15 [scrapy.core.engine] INFO: Spider opened
2022-10-31 14:45:15 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-10-31 14:45:15 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-10-31 14:45:16 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://example.org/robots.txt> (referer: None)
2022-10-31 14:45:16 [protego] DEBUG: Rule at line 12 without any user agent to enforce it on.
.........
2022-10-31 14:45:16 [protego] DEBUG: Rule at line 43 without any user agent to enforce it on.
2022-10-31 14:45:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.org> (referer: None)
2022-10-31 14:45:16 [scrapy.core.scraper] DEBUG: Scraped from <200 https://example.org>
{'url': 'https://example.org'}
2022-10-31 14:45:16 [scrapy.core.engine] INFO: Closing spider (finished)
2022-10-31 14:45:16 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 432,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 2034,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/404': 1,
'elapsed_time_seconds': 0.965107,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 10, 31, 14, 45, 16, 728119),
'httpcompression/response_bytes': 2512,
'httpcompression/response_count': 2,
'item_scraped_count': 1,
'log_count/DEBUG': 17,
'log_count/INFO': 10,
'log_count/WARNING': 1,
'memusage/max': 69152768,
'memusage/startup': 69152768,
'response_received_count': 2,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/404': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2022, 10, 31, 14, 45, 15, 763012)}
2022-10-31 14:45:16 [scrapy.core.engine] INFO: Spider closed (finished)
No need to this setting.
TWISTED_REACTOR
is still needed, I think. Scrapy checks if the installed reactor matches the setting and complaints otherwise.
Yes, (unlike CrawlerProcess
that install and verify the reactor) the CrawlerRunner
only checks if the installed reactor matches the settingTWISTED_REACTOR
. So we can uncomment this setting just to check the installed reactor. But that still won't solve the problem.
Are you filtering some logs out? I see some DEBUG messages in your post, but Scrapy also logs the reactor (and event loop, if present) at the beginning of the crawl, like:
2022-10-31 13:16:39 [scrapy.crawler] INFO: Overridden settings:
{'EDITOR': 'nano',
'SPIDER_LOADER_WARN_ONLY': True,
'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2022-10-31 13:16:39 [asyncio] DEBUG: Using selector: EpollSelector
2022-10-31 13:16:39 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2022-10-31 13:16:39 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.unix_events._UnixSelectorEventLoop
or
2022-10-31 13:17:19 [scrapy.crawler] INFO: Overridden settings:
{'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
'EDITOR': 'nano',
'LOGSTATS_INTERVAL': 0}
2022-10-31 13:17:19 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'
Scrapy logs the reactor if the setting TWISTED_REACTOR
is given.
I only filtered [py.warnings]
2022-10-31 16:22:51 [py.warnings] WARNING: /Users/alosultan/Development/Python/Django/channels-scrapy/venv/.envs/lib/python3.9/site-packages/scrapy/utils/request.py:231: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.
It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy.
See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
return cls(crawler)
2022-10-31 14:33:10 [asyncio] DEBUG: Using selector: KqueueSelector <-------- it's strange here
What do you think about this DEBUG message?
if I disable playwright, then this message will disappear:
Scrapy logs the reactor if the setting
TWISTED_REACTOR
is given.
That's from the "Overriden settings" line, not the one from scrapy.utils.log
which shows the actual reactor being used (https://github.com/scrapy/scrapy/blob/2.7.0/scrapy/utils/log.py#L157).
def handle(self, *args, **options): # An asyncio Twisted reactor has already installed (AsyncioSelectorReactor object) from twisted.internet import reactor
I don't understand where this is installed. I'm not that familiar with channels
, but I suppose it might give you a running asyncio loop. The Twisted reactor works on top of that, are you sure it's also being installed?
I don't understand where this is installed. I'm not that familiar with
channels
, but I suppose it might give you a running asyncio loop. The Twisted reactor works on top of that, are you sure it's also being installed?
I checked as follows:
def handle(self, *args, **options):
current_reactor = sys.modules.get("twisted.internet.reactor", None)
print(isinstance(current_reactor, asyncioreactor.AsyncioSelectorReactor)) # True
print(current_reactor.running) # False
# An asyncio Twisted reactor has already installed (AsyncioSelectorReactor object)
from twisted.internet import reactor
configure_logging()
runner = CrawlerRunner(settings=get_project_settings())
d = runner.crawl(options['spider'])
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished
How is it being installed? Where in the code is there something like the following?
from twisted.internet.asyncioreactor import install
install()
How is it being installed? Where in the code is there something like the following?
from twisted.internet.asyncioreactor import install install()
It being installed in the daphne.server.py module, which is imported in daphne.apps.py (Django app configuration module).
daphne.server.py
# This has to be done first as Twisted is import-order-sensitive with reactors
import asyncio # isort:skip
import os # isort:skip
import sys # isort:skip
import warnings # isort:skip
from concurrent.futures import ThreadPoolExecutor # isort:skip
from twisted.internet import asyncioreactor # isort:skip
twisted_loop = asyncio.new_event_loop()
if "ASGI_THREADS" in os.environ:
twisted_loop.set_default_executor(
ThreadPoolExecutor(max_workers=int(os.environ["ASGI_THREADS"]))
)
current_reactor = sys.modules.get("twisted.internet.reactor", None)
if current_reactor is not None:
if not isinstance(current_reactor, asyncioreactor.AsyncioSelectorReactor):
warnings.warn(
"Something has already installed a non-asyncio Twisted reactor. Attempting to uninstall it; "
+ "you can fix this warning by importing daphne.server early in your codebase or "
+ "finding the package that imports Twisted and importing it later on.",
UserWarning,
)
del sys.modules["twisted.internet.reactor"]
asyncioreactor.install(twisted_loop)
else:
asyncioreactor.install(twisted_loop)
My settings:
My Scrapy app is running under another app (django-channels) that runs a reactor twisted.internet.asyncioreactor.AsyncioSelectorReactor in the process. Therefore, to run spiders by my custom Django management command, I use CrawlerRunner so as not to install a reactor that is already installed.
But in this case, Scrapy-palywright cannot start working. There is no line in logs like:
In order for Scrapy-palywright to start working properly, I have to:
Is there any idea to continue using the already installed reactor?