scrapinghub / splash

Lightweight, scriptable browser as a service with an HTTP API
BSD 3-Clause "New" or "Revised" License
4.08k stars 514 forks source link

Splash fails to render a specific page #1167

Open joaodjvitor opened 1 year ago

joaodjvitor commented 1 year ago

My issue is related to querying the URL below using splash in version 3.5:

Even after many hours, it doesn't show any results. Here are some screenshots of the execution.

image

image

felipeabou commented 1 year ago

I'm suffering from the same... I've tried everything, with no luck. Any thoughts?

carlosrjr commented 1 year ago

Same problem...

benjad commented 1 year ago

Im experiencing similar issues, I did a test trying to render https://www.whatismybrowser.com/detect/is-javascript-enabled and Im getting that javascript isnt enabled www whatismybrowser com

joaodjvitor commented 1 year ago

I've been trying to resolve this issue for two weeks now, but I still haven't made any progress. I tried changing the splash versions, I tried looking at the site prompts to see if there were any points I made mistakes, but it didn't work. I tried to use Selenium and with it the results of the page came correct, but Selenium is very slow, therefore, it does not become a viable option.

benreece commented 1 year ago

I have the same issue. I think I've narrowed it down to this script: https://www.gstatic.com/recaptcha/releases/Km9gKuG06He-isPsP6saG8cn/recaptcha__en.js It seems to hang on many (most? all?) pages when that script is included.

If I filter out that script, it seems to run fine. I don't know what it is about that script that Splash doesn't like, but it just hangs.

I've seen the same behavior with one other script, as well -- what appears to be a WordPress extension: /wp-content/plugins/ninja-forms/assets/js/min/front-end.js?ver=3.6.14

It exhibits the same behavior: complete hang, but it works fine when I filter it out.

Tasty213 commented 1 year ago

Seems likely that they're captcha check files and if they can't be loaded the site assumes to continue anyway. We could add a splash filter (similar to the easyprivacy one) that blocks common captcha checking URLs.

sejteN-bot commented 1 year ago

I have the same issue. I think I've narrowed it down to this script: https://www.gstatic.com/recaptcha/releases/Km9gKuG06He-isPsP6saG8cn/recaptcha__en.js It seems to hang on many (most? all?) pages when that script is included.

If I filter out that script, it seems to run fine. I don't know what it is about that script that Splash doesn't like, but it just hangs.

I've seen the same behavior with one other script, as well -- what appears to be a WordPress extension: /wp-content/plugins/ninja-forms/assets/js/min/front-end.js?ver=3.6.14

It exhibits the same behavior: complete hang, but it works fine when I filter it out.

How do you filter it out properly? Cause I've tried some things in a script for lua_source, but it didn't work out..

benreece commented 1 year ago

How do you filter it out properly? Cause I've tried some things in a script for lua_source, but it didn't work out..

I just used the request filters built into Splash. These 2 lines are fairly broad, but worked for my purposes to block the 2 scripts I mentioned:

||gstatic.com/recaptcha/
ninja-forms
ercross commented 1 year ago

As @benreece mentioned, it's most likely some script is stopping splash from loading the page. In my case, it was recaptcha script, so I used the following lua script

splash:on_request(function(request)

    if string.find(request.url, "recaptcha") ~= nil then

        request.abort()

    end

end
)
alosultan commented 1 year ago

Recaptcha breaks Splash when using webkit engine. Take a look at this issue #1164