scrapinghub / splash

Lightweight, scriptable browser as a service with an HTTP API
BSD 3-Clause "New" or "Revised" License
4.04k stars 507 forks source link

Bug? splash 3.0+ instances locking up on certain SSL requests. Does not happen on 2.3.3 #1164

Open minispeck opened 1 year ago

minispeck commented 1 year ago

My issue happens on splash 3.0 and 3.5 but NOT on 2.3.3. i am currently running prod on 2.3.3 as a workaround and would like a permanent solution to run 3.x

i have been running splash + HAProxy set up by aquarium for years before experiencing this issue, including successfully rendering the sites in question without issue prior to the day before yesterday

here is a url that consistently produces the issue, even simply using render.html from [host]:8050 https://www.schooljobs.com/careers/kirkwoodcc/jobs/3776251/adjunct-dental-hygiene

happens with aquarium default configuration

this happens in both dev (mac OS 15+) and prod (ubuntu) environments, and i did try wiping all my containers and starting over with aquarium. splash works fine for other urls but the above and some others kills it. every time, it locks up the entire docker container (immediately) and the HAPROXY stats shows a level 7 timeout (splash 3.5) or Level 4 timeout (3.0).

image image

i cannot attach to a splash docker instance that hangs in this way - if i try, my terminal hangs.

thanks to docker-compose with aquarium i can watch splash output live. on 3.5 i often don't even get to see output of the request starting. sometimes i just see the request and then no more output as the instance hangs

image

on 3.0 only i get the following info

image

i have googled the network issue and found a bunch of issues right here in this repo with no clear answers about what is going on.

happy to be very responsive. please let me know if more info is needed. I want to get back to splash 3.x

rodrigosfelix commented 1 year ago

Same problem

Gallaecio commented 1 year ago

Since you say the issue started happening recently, without Splash itself changing, and assuming it is not something that has changed on the target websites, it means something other than Splash itself changed on your end. I assume some newer version of a dependency is at fault here.

My best guess would be Twisted, as Splash 2.3.3 caps it at 16.3.0, while 3.0+ do not cap it, and there have been recent releases. It would be great if someone could try if freezing Twisted at 16.3.0 works. If it does, we could then find the specific version where the issue starts happening, and that would help identify the issue. I would not discard that the problem is not Twisted itself, but some indirect dependency that Splash gets through its dependency on Twisted.

minispeck commented 1 year ago

@Gallaecio i'll give it a try today and report back

edit: day got away from me, shooting for monday

minispeck commented 1 year ago

@Gallaecio forcing twisted to 16.3.0 in a splash 3.5 docker container did not resolve the issue. the symptoms are the same.

for clarity in case i did something wrong, i did

docker exec -ti container_name /bin/bash

and once connected, ran

pip install twisted==16.3.0

afterword i ran pip freeze and confirmed the twisted version was indeed 16.3.0

then i ran my scraper that is known to cause the issue and observed the same symptoms

Gallaecio commented 1 year ago

Did running pip install twisted==16.3.0 output any warning about existing dependencies being incompatible?

minispeck commented 1 year ago

@Gallaecio one more piece of context, for these tests on my dev environment i'm running one splash 3.5 instance on twisted 16.3.0 and two on default (twisted 19 something)

although i did get the compatibility warning, the instance using twisted 16.3.0 works fine with sites that don't cause this issue, and exhibits the exact same failure behavior with the site that does cause the issue.

edit: i noticed my (working) splash 2.3.3 on prod is actually running twisted 16.1.1 - so i tried that version with splash 3.5 and observed the same issue. so i do not think the twisted version is the problem

Gallaecio commented 1 year ago

i did get the compatibility warning

Which packages was it about? It is possible the issue is not Twisted, but an indirect dependency.

If the issue is neither Twisted nor an indirect dependency, and it is actually an upstream change that is incompatible with newer Splash (i.e. with the WebKit version upgrade Splash 3.0 got), fixing the issue may be rather hard, and unlikely to be done any time soon, if ever.

minispeck commented 1 year ago

@Gallaecio the only warning was about splash incompatibility

image
Gallaecio commented 1 year ago

Then I don’t think Twisted is the issue :(

minispeck commented 1 year ago

@Gallaecio are there any more verbose logs i can produce for splash somehow, or from some directory? there is a splash verbosity setting that defaults to 1 during aquarium setup. I will try messing with that along with anything else you suggest

Gallaecio commented 1 year ago

I am not familiar enough with Splash to help much further.

and assuming it is not something that has changed on the target websites

I might have been wrong here, given dependencies are not an issue. Maybe those websites somehow stopped working with the version of WebKit that Splash 3.x uses.

minispeck commented 1 year ago

I might have been wrong here, given dependencies are not an issue. Maybe those websites somehow stopped working with the version of WebKit that Splash 3.x uses.

this might be true, but splash silently locking up and dying is not good behavior in this case

minispeck commented 1 year ago

bump. any ideas, anyone?

gtsupport-com commented 1 year ago

Recaptcha introduced code that breaks Splash 3.X in October, confirmed with 3.2 and 3.5. For simply reading a site, adding an on_request() hook at the beginning of your script that blocks any attempts to access a URL that contains "recaptcha/releases" will prevent it from locking up.

I'm not aware of any workarounds or any root-cause information as to what that Javascript is doing that is breaking Splash.

minispeck commented 1 year ago

@gtsupport-com thank you for the answer - and my apologies, i'm using the built in splash render.html - are you talking about the lua script? I never did learn lua, could you spell this out for me?

thanks

gtsupport-com commented 1 year ago

@minispeck The methods I've used involved this: splash-on-request

All of my experience has been via /execute and lua scripts thus I'm not familiar with the options for the built in renderers. My first guess would be to place your own proxy in front of your splash instance and block it via that proxy. I don't see an option in the splash documentation to auto-blacklist certain urls; if you're dependent on render.html I don't have an easy answer for you.

minispeck commented 1 year ago

@gtsupport-com oh sorry i meant, i'm happy to move to execute endpoint, just 0 lua knowledge, so assuming i start with a copy of the default script, could you toss me some sample code for on_request to kill those requests?

gtsupport-com commented 1 year ago

This will grab that page - delete the "args.url= ..." line if you are passing the URL in externally. Last line returns both a PNG and HTML, replace with "return splash:html()" if you only need the HTML back for data extraction.

There are a large number of examples on the Splash documentation site, it would be worth your while to dig into the tutorial so you can troubleshoot/tweak if necessary.

function main(splash, args)
  args.url = [[https://www.schooljobs.com/careers/kirkwoodcc/jobs/3776251/adjunct-dental-hygiene]]
  splash:on_request(function(request)
    if string.find(request.url, "recaptcha/releases", 1, true) ~= nil then
        request.abort()
    end
  end)
  splash:go{args.url}
  splash:wait(2)
  return {png=splash:png(), html=splash:html()}
end
gtsupport-com commented 1 year ago

Note that it was also identified by @benreece in #1167 that not only Recaptcha but certain WP plugins cause this issue

alosultan commented 11 months ago

@minispeck You should set the engine parameter to chromium instead of webkit (the default engine). In this case, Recaptcha will not disrupt Splash 3.X. However, it's important to note that the Splash documentation warns that the chromium engine is currently in the pre-alpha stage and could potentially lead to crashes in Splash.

Another issue arises from the fact that the webkit engine does not pass the check for whether JavaScript is enabled or not, which poses a problem for us even with basic websites that perform this verification.

Please take into consideration: @kmike | @immerrr | @Gallaecio

alosultan commented 11 months ago

@minispeck If you insist on using the WebKit engine (it's lightweight and fast, but QtWebKit is awaiting updates - here I want to thank @annulen for his great efforts: большое Вам спасибо), you'll need to utilize the filters parameter, as recommended by @gtsupport-com, as a temporary solution.

annulen commented 11 months ago

FYI, you can get updated version of QtWebKit maintained by @mnutt at https://github.com/movableink/webkit/ — it's very close to WebKit's bleeding edge and should have much better compatibility with modern web content (though it's not polished at the moment and can have quite a few rough edges).

alosultan commented 11 months ago

This is great & worth a try. @annulen @mnutt Thank you for your great efforts.