scrapy-plugins / scrapy-splash

Scrapy+Splash for JavaScript integration
BSD 3-Clause "New" or "Revised" License
3.15k stars 450 forks source link

question about SplashMiddleware #315

Open biaobro opened 6 days ago

biaobro commented 6 days ago

'scrapy_splash.SplashMiddleware': 725 —— just noticed different behaviors within or without the config, can someone help to give some advices>

enable the setting, I got nothing been crawled and the info: 2024-10-20 15:45:00 [scrapy.downloadermiddlewares.offsite] DEBUG: Filtered offsite request to 'localhost': <GET https://www.adamchoi.co.uk/overs/detailed via http://localhost:8050/execute> 2024-10-20 15:45:00 [scrapy.core.engine] DEBUG: Signal handler scrapy.downloadermiddlewares.offsite.OffsiteMiddleware.request_scheduled dropped request <GET https://www.adamchoi.co.uk/overs/detailed via http://localhost:8050/execute> before it reached the scheduler.

disable the setting, I got the html source code but none javascript file been rendered 2024-10-20 15:39:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.adamchoi.co.uk/overs/detailed> (referer: None) b'<!doctype html>\n<html class="no-js" lang="en">\n\n <head>\n <meta charset="utf-8">\n <title>Football Statistics For Betting</title>\n <meta name="description" content="The best football statistics for popular betting markets | BTTS | Corners | Cards | Booking Points | Over 2.5 Goals | Both Teams To Score | BTTS and Win">\n <meta name="keywords" content="bets prediction betting site football statistics stats btts both teams to score overs corners cards tips booking points team goals">\n <meta name="twitter:card" content="summary_large_image" />\n <meta name="twitter:site" content="https://www.adamchoi.co.uk" />\n <meta name="twitter:title" content="Football Statistics For Betting" />\n <meta name="twitter:description" content="BTTS, Corners, Cards, Booking Points, Overs, Team Goals, BTTS & Win statistics for betting. Many more markets covered across over 50 leagues around the world." />\n <meta name="twitter:image" content="https://www.adamchoi.co.uk/images/og.png?v=1" />\n <meta property="og:title" content="Football Statistics For Betting"/>\n <meta property="og:url" content="https://www.adamchoi.co.uk"/>\n <meta property="og:description" content="BTTS, Corners, Cards, Booking Points, Overs, Team Goals, BTTS & Win statistics for betting. Many more markets covered across over 50 leagues around the world."/>\n <meta property="og:image" content="https://www.adamchoi.co.uk/images/og.png?v=1"/>\n <meta property="og:locale" content="en_GB"/>\n <meta property="og:type" content="website"/>\n <meta name="viewport" content="width=device-width">\n\n <base href=\'/\'>\n <link rel="stylesheet" href="dist/css/vendor-bundle-599428b2b3.css">\n <link rel="stylesheet" href="dist/css/app-bundle-17088dbaef.css?v=1">\n\n <script src="dist/js/vendor-bundle-bebd0fdb69.js"></script>\n <script src="dist/js/app-bundle-798a12ba74.js"></script>\n <!-- endbuild -->\n\n <script>\n (function(i,s,o,g,r,a,m){i[\'GoogleAnalyticsObject\']=r;i[r]=i[r]||function(){\n (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),\n m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)\n })(window,document,\'script\',\'//www.google-analytics.com/analytics.js\',\'ga\');\n </script>\n\n <!-- Google Analytics -->\n <script async src="https://www.googletagmanager.com/gtag/js?id=G-8MTGZ91RT2"></script>\n <script>\n window.dataLayer = window.dataLayer || [];\n function gtag(){dataLayer.push(arguments);}\n gtag(\'js\', new Date());\n\n gtag(\'config\', \'G-8MTGZ91RT2\');\n </script>\n\n <!-- Google Tag Manager -->\n <script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({\'gtm.start\':\n new Date().getTime(),event:\'gtm.js\'});var f=d.getElementsByTagName(s)[0],\n j=d.createElement(s),dl=l!=\'dataLayer\'?\'&l=\'+l:\'\';j.async=true;j.src=\n \'https://www.googletagmanager.com/gtm.js?id=\'+i+dl;f.parentNode.insertBefore(j,f);\n })(window,document,\'script\',\'dataLayer\',\'GTM-5GQQMBP\');</script>\n <!-- End Google Tag Manager -->\n\n <!-- Google Ad Manager -->\n <script async src="https://securepubads.g.doubleclick.net/tag/js/gpt.js"></script>\n <script>\n window.googletag = window.googletag || {cmd: []};\n\n googletag.cmd.push(function() {\n googletag.pubads().enableLazyLoad();\n googletag.pubads().setCentering(true);\n googletag.pubads().collapseEmptyDivs();\n setInterval(function(){ googletag.pubads().refresh(); }, 30000);\n });\n\n </script>\n\n </head>\n \n <body>\n <!-- Google Tag Manager (noscript) -->\n <noscript><iframe src="https://www.googletagmanager.com/ns.html?id=GTM-5GQQMBP"\n height="0" width="0" style="display:none;visibility:hidden"></iframe></noscript>\n <!-- End Google Tag Manager (noscript) -->\n\n <div data-ng-app="adamChoiStatsApp">\n\n <div data-ui-view="rootView">\n\n </div>\n </div>\n\n <script defer src="https://static.cloudflareinsights.com/beacon.min.js/vcd15cbe7772f49c399c6a5babf22c1241717689176015" integrity="sha512-ZpsOmlRQV6y907TI0dKBHq9Md29nnaEIPlkf84rnaERnq6zvWvPUqr2ft8M1aS28oN72PdrCzSjY4U6VaAw1EQ==" data-cf-beacon=\'{"rayId":"8d5759fa98982ab4","version":"2024.10.1","r":1,"serverTiming":{"name":{"cfExtPri":true,"cfL4":true,"cfSpeedBrain":true,"cfCacheStatus":true}},"token":"4a403f83ab324f8d9ddbdcd08ed7ae8d","b":1}\' crossorigin="anonymous"></script>\n</body>\n\n</html>\n'

my spider file `import scrapy from scrapy_splash import SplashRequest

class AdamchoiSpider(scrapy.Spider): name = "adamchoi" allowed_domains = ["www.adamchoi.co.uk"]

start_urls = ["https://www.adamchoi.co.uk/overs/detailed"]

script = '''
    function main(splash, args)
      splash.private_mode_enabled = false
      assert(splash:go(args.url))
      assert(splash:wait(3))
      all_matches = assert(splash:select_all('label.btn.btn-sm.btn-primary'))
      all_matches[2]:mouse_click()
      assert(splash:wait(3))
      splash:set_viewport_full()
      return {
        splash:html(),
        splash:png()
      }
    end
'''

def start_requests(self):
    yield SplashRequest(
        url='https://www.adamchoi.co.uk/overs/detailed',
        callback=self.parse,
        endpoint='execute',
        args={'lua_source': self.script}
    )

def parse(self, response):
    print(response.body)
    rows = response.xpath('//tr')
    for row in rows:
        date = row.xpath('./td[1]/text()').get()
        home_team = row.xpath('./td[2]/text()').get()
        score = row.xpath('./td[3]/text()').get()
        away_team = row.xpath('./td[4]/text()').get()

        yield {
            'date': date,
            'home_team': home_team,
            'score': score,
            'away_team': away_team
        }

`

my setting file

`SPLASH_URL = 'http://localhost:8050'

DOWNLOADER_MIDDLEWARES = { 'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, }

SPIDER_MIDDLEWARES = { 'scrapy_splash.SplashDeduplicateArgsMiddleware': 100, }

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36' `

Gallaecio commented 5 days ago

Try enabling the middleware but also adding dont_filter=True to your requests.