scrapinghub / splash

Lightweight, scriptable browser as a service with an HTTP API
BSD 3-Clause "New" or "Revised" License
4.04k stars 507 forks source link

Chromium engine not working with lua #1186

Open kolumbyt opened 7 months ago

kolumbyt commented 7 months ago

Hello, I'm trying to make scraping bot for a site that uses javascript. I have about 20 urls from the site and would like to scale to houndreds, I need the urls to be scraped quite often, so I tried using lua script do make "dynamic" waiting times. When I use the default webkit engine, the html output of the site is just text that says that the site doesn't support this browser, that's why I'm using chromium engine. Without the lua script the scraping gave output items only with chromium engine, but it did work. After I tried it with lua I got errors with chromium engine, and with webkit it executed without errors, but didn't give any output items, because as I said the site doesn't support it. This is the start request I'm using with the lua:

def start_requests(self):
    lua_script = """
    function main(splash, args)
        local try_count = 0
        local max_tries = 10
        while try_count < max_tries do
            local match_rows = splash:select_all('.o-matchRow')
            if #match_rows > 0 then
            try_count = try_count + 1
        return {html = splash:html()}

    # Chrome user agent
    user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'

    for url in self.start_urls:
        yield SplashRequest(
                'lua_source': lua_script,
                'user_agent': user_agent,
                'engine': 'chromium'

It's something simple I wanted to test out. Does anyone know what is the deal with lua and chromium engine, or how can I use webkit when the site doesn't support it? (Btw sorry for my English, I'm not a native speaker) These are the errors with chromium engine:

2023-12-10 09:49:45 [scrapy.core.scraper] ERROR: Error downloading <GET via http://localhost:8050/execute> Traceback (most recent call last): File "C:\Users\Kryštof\AppData\Local\Programs\Python\Python311\Lib\site-packages\twisted\internet\", line 1697, in _inlineCallbacks result =, result) File "C:\Users\Kryštof\AppData\Local\Programs\Python\Python311\Lib\site-packages\scrapy\core\downloader\", line 68, in process_response method(request=request, response=response, spider=spider) File "C:\Users\Kryštof\AppData\Local\Programs\Python\Python311\Lib\site-packages\scrapy_splash\", line 412, in process_response response = self._change_response_class(request, response) File "C:\Users\Kryštof\AppData\Local\Programs\Python\Python311\Lib\site-packages\scrapy_splash\", line 433, in _change_response_class response = response.replace(cls=respcls, request=request) File "C:\Users\Kryštof\AppData\Local\Programs\Python\Python311\Lib\site-packages\scrapy\http\response\", line 125, in replace return cls(*args, **kwargs) File "C:\Users\Kryštof\AppData\Local\Programs\Python\Python311\Lib\site-packages\scrapy_splash\", line 120, in __init__ self._load_from_json() File "C:\Users\Kryštof\AppData\Local\Programs\Python\Python311\Lib\site-packages\scrapy_splash\", line 174, in _load_from_json error =['info']['error'] TypeError: string indices must be integers, not 'str' 2023-12-10 09:49:45 [scrapy.core.engine] INFO: Closing spider (finished)

I've been trying to set it up correctly for the past few days, but I'm not really getting anywhere. It seems I should build a custom image for splash, so I did, and it doesn't really work. The element I'm checking for is in there, it worked without the lua script before. User agent didn't do anything either, it seems that I need to have the chromium engine. And the data should be handled correctly, because it worked before with working item output. What should I try next? The issue should be just with lua not working with chromium engine. Or are there other options to make the "dynamic" waits? Or can I use webkit on a site that doesn't support it?