scrapy-plugins / scrapy-playwright

🎭 Playwright integration for Scrapy
BSD 3-Clause "New" or "Revised" License
1.03k stars 113 forks source link

meta['playwright_page'] is None on first few attempts at using scrapy playwright #317

Closed rubmz closed 2 months ago

rubmz commented 2 months ago

Hi,

So below is a minimal example of the code I use in my spider (spider.py, settings.py, ). The problem is, that for the first call and the subsequent (until a few seconds pass by) in parse() function the 'playwright_page' in meta is undefined, causing the call to page = response.meta['playwright_page'] to raise exception. Why does this happen? Is there a service I need to initialize and wait for before starting? Currently I am using 'devkit' in my browser context. But I suspect it is not the case, or is it?

os: ubuntu 22.04 python 3.12 packages:

site_scraper_spider.py:

import json
import os

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule

from spiders.scraper.page_item import PageItem

class SiteScraperSpider(scrapy.Spider):
    allowed_domains = json.loads(os.getenv('SPIDER_ALLOWED_DOMAINS'))

    @staticmethod
    def set_playwright_meta(request, response):
        request.meta['playwright'] = True
        request.meta['playwright_include_page'] = True
        return request
    rules = (
        Rule(LinkExtractor(canonicalize=True, allow_domains=allowed_domains), callback='parse', follow=False, process_request=set_playwright_meta),
    )
    start_urls = json.loads(os.getenv('SPIDER_START_URLS'))
    max_depth = int(os.getenv('SPIDER_MAX_DEPTH'))

    def __init__(self, name=None, **kwargs):
        super().__init__(name, **kwargs)
        self.link_set = set()
        self.retry_dict = dict()

    def start_requests(self):
        for url in SiteScraperSpider.start_urls:
            request = scrapy.Request(url, meta=dict(errback=self.errback))
            request = SiteScraperSpider.set_playwright_meta(request, None)
            yield request

    async def parse(self, response, **kwargs):
        page = response.meta['playwright_page']
        # ...

    async def errback(self, failure):
        page = failure.request.meta["playwright_page"]
        await page.close()
        print(f'Error parsing page {page.url} - {failure}')

### settings.py:

import sys
from importlib import import_module
from pathlib import Path

ADDONS = {}

AJAXCRAWL_ENABLED = False

ASYNCIO_EVENT_LOOP = None

AUTOTHROTTLE_ENABLED = False
AUTOTHROTTLE_DEBUG = False
AUTOTHROTTLE_MAX_DELAY = 60.0
AUTOTHROTTLE_START_DELAY = 5.0
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

BOT_NAME = 'scraper'
SPIDER_MODULES = ['scraper']

CLOSESPIDER_TIMEOUT = 0
CLOSESPIDER_PAGECOUNT = 0
CLOSESPIDER_ITEMCOUNT = 0
CLOSESPIDER_ERRORCOUNT = 0

COMMANDS_MODULE = ""

COMPRESSION_ENABLED = True

CONCURRENT_ITEMS = 100

CONCURRENT_REQUESTS = 16
CONCURRENT_REQUESTS_PER_DOMAIN = 8
CONCURRENT_REQUESTS_PER_IP = 0

COOKIES_ENABLED = True
COOKIES_DEBUG = False

DEFAULT_ITEM_CLASS = "scrapy.item.Item"

DEFAULT_REQUEST_HEADERS = {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en",
}

DEPTH_LIMIT = 0
DEPTH_STATS_VERBOSE = False
DEPTH_PRIORITY = 0

DNSCACHE_ENABLED = True
DNSCACHE_SIZE = 10000
DNS_RESOLVER = "scrapy.resolver.CachingThreadedResolver"
DNS_TIMEOUT = 60

DOWNLOAD_DELAY = 0

DOWNLOAD_HANDLERS = {
    "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
DOWNLOAD_HANDLERS_BASE = {
    "data": "scrapy.core.downloader.handlers.datauri.DataURIDownloadHandler",
    "file": "scrapy.core.downloader.handlers.file.FileDownloadHandler",
    "http": "scrapy.core.downloader.handlers.http.HTTPDownloadHandler",
    "https": "scrapy.core.downloader.handlers.http.HTTPDownloadHandler",
    "s3": "scrapy.core.downloader.handlers.s3.S3DownloadHandler",
    "ftp": "scrapy.core.downloader.handlers.ftp.FTPDownloadHandler",
}
PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT = (
    30 * 1000
)

PLAYWRIGHT_BROWSER_TYPE = os.getenv('SPIDER_BROWSER', 'webkit')
PLAYWRIGHT_CONTEXTS = {
    'IPHONE_12_MINI': {
        'is_mobile': True,
        'has_touch': True,
        'screen': {
            'width': 375,
            'height': 812
        },
        'viewport': {
            'width': 375,
            'height': 629
        },
        'device_scale_factor': 3
    },
    'And': {
        'is_mobile': True,
        'has_touch': True,
        'screen': {
            'width': 375,
            'height': 812
        },
        'viewport': {
            'width': 375,
            'height': 629
        },
        'device_scale_factor': 3
    },
}

DOWNLOAD_TIMEOUT = 180  # 3mins

DOWNLOAD_MAXSIZE = 1024 * 1024 * 1024  # 1024m
DOWNLOAD_WARNSIZE = 32 * 1024 * 1024  # 32m

DOWNLOAD_FAIL_ON_DATALOSS = True

DOWNLOADER = "scrapy.core.downloader.Downloader"

DOWNLOADER_HTTPCLIENTFACTORY = (
    "scrapy.core.downloader.webclient.ScrapyHTTPClientFactory"
)
DOWNLOADER_CLIENTCONTEXTFACTORY = (
    "scrapy.core.downloader.contextfactory.ScrapyClientContextFactory"
)
DOWNLOADER_CLIENT_TLS_CIPHERS = "DEFAULT"
# Use highest TLS/SSL protocol version supported by the platform, also allowing negotiation:
DOWNLOADER_CLIENT_TLS_METHOD = "TLS"
DOWNLOADER_CLIENT_TLS_VERBOSE_LOGGING = False

DOWNLOADER_MIDDLEWARES = {}

DOWNLOADER_MIDDLEWARES_BASE = {
    # Engine side
    "scrapy.downloadermiddlewares.offsite.OffsiteMiddleware": 50,
    "scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware": 100,
    "scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware": 300,
    "scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware": 350,
    "scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware": 400,
    "scrapy.downloadermiddlewares.useragent.UserAgentMiddleware": 500,
    "scrapy.downloadermiddlewares.retry.RetryMiddleware": 550,
    "scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware": 560,
    "scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware": 580,
    "scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware": 590,
    "scrapy.downloadermiddlewares.redirect.RedirectMiddleware": 600,
    "scrapy.downloadermiddlewares.cookies.CookiesMiddleware": 700,
    "scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware": 750,
    "scrapy.downloadermiddlewares.stats.DownloaderStats": 850,
    "scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware": 900,
    # Downloader side
}

DOWNLOADER_STATS = True

DUPEFILTER_CLASS = "scrapy.dupefilters.RFPDupeFilter"

EDITOR = "vi"
if sys.platform == "win32":
    EDITOR = "%s -m idlelib.idle"

EXTENSIONS = {}

EXTENSIONS_BASE = {
    "scrapy.extensions.corestats.CoreStats": 0,
    "scrapy.extensions.telnet.TelnetConsole": 0,
    "scrapy.extensions.memusage.MemoryUsage": 0,
    "scrapy.extensions.memdebug.MemoryDebugger": 0,
    "scrapy.extensions.closespider.CloseSpider": 0,
    "scrapy.extensions.feedexport.FeedExporter": 0,
    "scrapy.extensions.logstats.LogStats": 0,
    "scrapy.extensions.spiderstate.SpiderState": 0,
    "scrapy.extensions.throttle.AutoThrottle": 0,
}

FEED_TEMPDIR = None
FEEDS = {}
FEED_URI_PARAMS = None  # a function to extend uri arguments
FEED_STORE_EMPTY = True
FEED_EXPORT_ENCODING = None
FEED_EXPORT_FIELDS = None
FEED_STORAGES = {}
FEED_STORAGES_BASE = {
    "": "scrapy.extensions.feedexport.FileFeedStorage",
    "file": "scrapy.extensions.feedexport.FileFeedStorage",
    "ftp": "scrapy.extensions.feedexport.FTPFeedStorage",
    "gs": "scrapy.extensions.feedexport.GCSFeedStorage",
    "s3": "scrapy.extensions.feedexport.S3FeedStorage",
    "stdout": "scrapy.extensions.feedexport.StdoutFeedStorage",
}
FEED_EXPORT_BATCH_ITEM_COUNT = 0
FEED_EXPORTERS = {}
FEED_EXPORTERS_BASE = {
    "json": "scrapy.exporters.JsonItemExporter",
    "jsonlines": "scrapy.exporters.JsonLinesItemExporter",
    "jsonl": "scrapy.exporters.JsonLinesItemExporter",
    "jl": "scrapy.exporters.JsonLinesItemExporter",
    "csv": "scrapy.exporters.CsvItemExporter",
    "xml": "scrapy.exporters.XmlItemExporter",
    "marshal": "scrapy.exporters.MarshalItemExporter",
    "pickle": "scrapy.exporters.PickleItemExporter",
}
FEED_EXPORT_INDENT = 0

FEED_STORAGE_FTP_ACTIVE = False
FEED_STORAGE_GCS_ACL = ""
FEED_STORAGE_S3_ACL = ""

FILES_STORE_S3_ACL = "private"
FILES_STORE_GCS_ACL = ""

FTP_USER = "anonymous"
FTP_PASSWORD = "guest"  # nosec
FTP_PASSIVE_MODE = True

GCS_PROJECT_ID = None

HTTPCACHE_ENABLED = False
HTTPCACHE_DIR = "httpcache"
HTTPCACHE_IGNORE_MISSING = False
HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"
HTTPCACHE_EXPIRATION_SECS = 0
HTTPCACHE_ALWAYS_STORE = False
HTTPCACHE_IGNORE_HTTP_CODES = []
HTTPCACHE_IGNORE_SCHEMES = ["file"]
HTTPCACHE_IGNORE_RESPONSE_CACHE_CONTROLS = []
HTTPCACHE_DBM_MODULE = "dbm"
HTTPCACHE_POLICY = "scrapy.extensions.httpcache.DummyPolicy"
HTTPCACHE_GZIP = False

HTTPPROXY_ENABLED = True
HTTPPROXY_AUTH_ENCODING = "latin-1"

IMAGES_STORE_S3_ACL = "private"
IMAGES_STORE_GCS_ACL = ""

ITEM_PROCESSOR = "scrapy.pipelines.ItemPipelineManager"
ITEM_PIPELINES = {
   'spiders.scraper.scraper_data_pipeline.ScraperDataPipeline': 300,
}
ITEM_PIPELINES_BASE = {}

JOBDIR = None

LOG_ENABLED = True
LOG_ENCODING = "utf-8"
LOG_FORMATTER = "scrapy.logformatter.LogFormatter"
LOG_FORMAT = "%(asctime)s [%(name)s] %(levelname)s: %(message)s"
LOG_DATEFORMAT = "%Y-%m-%d %H:%M:%S"
LOG_STDOUT = False
LOG_LEVEL = "DEBUG"
LOG_FILE = None
LOG_FILE_APPEND = True
LOG_SHORT_NAMES = False

SCHEDULER_DEBUG = False

LOGSTATS_INTERVAL = 60.0

MAIL_HOST = "localhost"
MAIL_PORT = 25
MAIL_FROM = "scrapy@localhost"
MAIL_PASS = None
MAIL_USER = None

MEMDEBUG_ENABLED = False  # enable memory debugging
MEMDEBUG_NOTIFY = []  # send memory debugging report by mail at engine shutdown

MEMUSAGE_CHECK_INTERVAL_SECONDS = 60.0
MEMUSAGE_ENABLED = True
MEMUSAGE_LIMIT_MB = 0
MEMUSAGE_NOTIFY_MAIL = []
MEMUSAGE_WARNING_MB = 0

METAREFRESH_ENABLED = True
METAREFRESH_IGNORE_TAGS = ["noscript"]
METAREFRESH_MAXDELAY = 100

NEWSPIDER_MODULE = ""

PERIODIC_LOG_DELTA = None
PERIODIC_LOG_STATS = None
PERIODIC_LOG_TIMING_ENABLED = False

RANDOMIZE_DOWNLOAD_DELAY = True

REACTOR_THREADPOOL_MAXSIZE = 10

REDIRECT_ENABLED = True
REDIRECT_MAX_TIMES = 20  # uses Firefox default setting
REDIRECT_PRIORITY_ADJUST = +2

REFERER_ENABLED = True
REFERRER_POLICY = "scrapy.spidermiddlewares.referer.DefaultReferrerPolicy"

REQUEST_FINGERPRINTER_CLASS = "scrapy.utils.request.RequestFingerprinter"
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"

RETRY_ENABLED = True
RETRY_TIMES = 2  # initial response + 2 retries = 3 requests
RETRY_HTTP_CODES = [500, 502, 503, 504, 522, 524, 408, 429]
RETRY_PRIORITY_ADJUST = -1
RETRY_EXCEPTIONS = [
    "twisted.internet.defer.TimeoutError",
    "twisted.internet.error.TimeoutError",
    "twisted.internet.error.DNSLookupError",
    "twisted.internet.error.ConnectionRefusedError",
    "twisted.internet.error.ConnectionDone",
    "twisted.internet.error.ConnectError",
    "twisted.internet.error.ConnectionLost",
    "twisted.internet.error.TCPTimedOutError",
    "twisted.web.client.ResponseFailed",
    # OSError is raised by the HttpCompression middleware when trying to
    # decompress an empty response
    OSError,
    "scrapy.core.downloader.handlers.http11.TunnelError",
]

ROBOTSTXT_OBEY = False
ROBOTSTXT_PARSER = "scrapy.robotstxt.ProtegoRobotParser"
ROBOTSTXT_USER_AGENT = None

SCHEDULER = "scrapy.core.scheduler.Scheduler"
SCHEDULER_DISK_QUEUE = "scrapy.squeues.PickleLifoDiskQueue"
SCHEDULER_MEMORY_QUEUE = "scrapy.squeues.LifoMemoryQueue"
SCHEDULER_PRIORITY_QUEUE = "scrapy.pqueues.ScrapyPriorityQueue"

SCRAPER_SLOT_MAX_ACTIVE_SIZE = 5000000

SPIDER_LOADER_CLASS = "scrapy.spiderloader.SpiderLoader"
SPIDER_LOADER_WARN_ONLY = False

SPIDER_MIDDLEWARES = {}

SPIDER_MIDDLEWARES_BASE = {
    # Engine side
    "scrapy.spidermiddlewares.httperror.HttpErrorMiddleware": 50,
    "scrapy.spidermiddlewares.referer.RefererMiddleware": 700,
    "scrapy.spidermiddlewares.urllength.UrlLengthMiddleware": 800,
    "scrapy.spidermiddlewares.depth.DepthMiddleware": 900,
    # Spider side
}

SPIDER_MODULES = []

STATS_CLASS = "scrapy.statscollectors.MemoryStatsCollector"
STATS_DUMP = True

STATSMAILER_RCPTS = []

TEMPLATES_DIR = str((Path(__file__).parent / ".." / "templates").resolve())

URLLENGTH_LIMIT = 2083

USER_AGENT = f'Scrapy/{import_module("scrapy").__version__} (+https://scrapy.org)'

TELNETCONSOLE_ENABLED = False

TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'

SPIDER_CONTRACTS = {}
SPIDER_CONTRACTS_BASE = {
    "scrapy.contracts.default.UrlContract": 1,
    "scrapy.contracts.default.CallbackKeywordArgumentsContract": 1,
    "scrapy.contracts.default.ReturnsContract": 2,
    "scrapy.contracts.default.ScrapesContract": 3,
}

### spider_runner.py

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from site_scraper_spider import SiteScraperSpider

c = CrawlerProcess(get_project_settings())
c.crawl(SiteScraperSpider)
c.start()
elacuesta commented 2 months ago

I cannot reproduce, the spider works just fine for me:

(...)
2024-09-10 15:48:05 [scrapy.core.engine] INFO: Spider opened
2024-09-10 15:48:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-09-10 15:48:05 [scrapy-playwright] INFO: Starting download handler
2024-09-10 15:48:05 [scrapy-playwright] INFO: Starting download handler
2024-09-10 15:48:05 [scrapy-playwright] INFO: Launching 2 startup context(s)
2024-09-10 15:48:05 [scrapy-playwright] INFO: Launching browser webkit
2024-09-10 15:48:05 [scrapy-playwright] INFO: Launching 2 startup context(s)
2024-09-10 15:48:05 [scrapy-playwright] INFO: Launching browser webkit
2024-09-10 15:48:05 [scrapy-playwright] INFO: Browser webkit launched
2024-09-10 15:48:05 [scrapy-playwright] INFO: Browser webkit launched
2024-09-10 15:48:05 [scrapy-playwright] DEBUG: Browser context started: 'IPHONE_12_MINI' (persistent=False, remote=False)
2024-09-10 15:48:05 [scrapy-playwright] DEBUG: Browser context started: 'IPHONE_12_MINI' (persistent=False, remote=False)
2024-09-10 15:48:05 [scrapy-playwright] DEBUG: Browser context started: 'And' (persistent=False, remote=False)
2024-09-10 15:48:05 [scrapy-playwright] INFO: Startup context(s) launched
2024-09-10 15:48:05 [scrapy-playwright] DEBUG: Browser context started: 'And' (persistent=False, remote=False)
2024-09-10 15:48:05 [scrapy-playwright] INFO: Startup context(s) launched
2024-09-10 15:48:10 [scrapy-playwright] DEBUG: Browser context started: 'default' (persistent=False, remote=False)
2024-09-10 15:48:10 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
2024-09-10 15:48:10 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://example.org/> (resource type: document)
2024-09-10 15:48:11 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://example.org/>
2024-09-10 15:48:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.org> (referer: None) ['playwright']
response: https://example.org/
2024-09-10 15:48:11 [scrapy.core.engine] INFO: Closing spider (finished)
2024-09-10 15:48:11 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 212,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 1602,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 5.874917,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2024, 9, 10, 18, 48, 11, 412118, tzinfo=datetime.timezone.utc),
 'log_count/DEBUG': 12,
 'log_count/INFO': 18,
 'memusage/max': 70029312,
 'memusage/startup': 70029312,
 'playwright/browser_count': 2,
 'playwright/context_count': 5,
 'playwright/context_count/max_concurrent': 3,
 'playwright/context_count/persistent/False': 5,
 'playwright/context_count/remote/False': 5,
 'playwright/page_count': 1,
 'playwright/page_count/max_concurrent': 1,
 'playwright/request_count': 1,
 'playwright/request_count/method/GET': 1,
 'playwright/request_count/navigation': 1,
 'playwright/request_count/resource_type/document': 1,
 'playwright/response_count': 1,
 'playwright/response_count/method/GET': 1,
 'playwright/response_count/resource_type/document': 1,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2024, 9, 10, 18, 48, 5, 537201, tzinfo=datetime.timezone.utc)}
2024-09-10 15:48:11 [scrapy.core.engine] INFO: Spider closed (finished)
2024-09-10 15:48:11 [scrapy-playwright] INFO: Closing download handler
2024-09-10 15:48:11 [scrapy-playwright] DEBUG: Browser context closed: 'IPHONE_12_MINI' (persistent=False, remote=False)
2024-09-10 15:48:11 [scrapy-playwright] DEBUG: Browser context closed: 'And' (persistent=False, remote=False)
2024-09-10 15:48:11 [scrapy-playwright] INFO: Closing browser
2024-09-10 15:48:11 [scrapy-playwright] DEBUG: Browser disconnected
2024-09-10 15:48:11 [scrapy-playwright] INFO: Closing download handler
2024-09-10 15:48:11 [scrapy-playwright] DEBUG: Browser context closed: 'IPHONE_12_MINI' (persistent=False, remote=False)
2024-09-10 15:48:11 [scrapy-playwright] DEBUG: Browser context closed: 'And' (persistent=False, remote=False)
2024-09-10 15:48:11 [scrapy-playwright] DEBUG: Browser context closed: 'default' (persistent=False, remote=False)
2024-09-10 15:48:11 [scrapy-playwright] INFO: Closing browser
2024-09-10 15:48:11 [scrapy-playwright] DEBUG: Browser disconnected
$ scrapy version -v                   
2024-09-10 15:50:22 [scrapy.utils.log] INFO: Scrapy 2.11.2 started (bot: scraper)
2024-09-10 15:50:22 [scrapy.utils.log] INFO: Versions: lxml 5.2.2.0, libxml2 2.12.6, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.3.0, Python 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0], pyOpenSSL 24.2.1 (OpenSSL 3.3.1 4 Jun 2024), cryptography 43.0.0, Platform Linux-6.5.0-45-generic-x86_64-with-glibc2.35
Scrapy       : 2.11.2
lxml         : 5.2.2.0
libxml2      : 2.12.6
cssselect    : 1.2.0
parsel       : 1.9.1
w3lib        : 2.2.1
Twisted      : 24.3.0
Python       : 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0]
pyOpenSSL    : 24.2.1 (OpenSSL 3.3.1 4 Jun 2024)
cryptography : 43.0.0
Platform     : Linux-6.5.0-45-generic-x86_64-with-glibc2.35

$ pip freeze | grep playwright
playwright==1.46.0
scrapy-playwright==0.0.41

The provided example is not self-contained at all, I had to make several adjustments in order to make it work (dead code, irrelevant settings, missing item classes, pipelines & env variables, etc).

rubmz commented 1 week ago

Could it be that because the spider is stopped externally by the spawner (or debugger) it leaves something running in the background? It is very much reproducible for some reason with my configuration... After waiting for 5 minutes I can re-run the spider again without a problem.