scrapinghub / splash

Lightweight, scriptable browser as a service with an HTTP API
BSD 3-Clause "New" or "Revised" License
4.04k stars 507 forks source link

ReferenceError: Can't find variable: IntersectionObserver #1169

Open brett--anderson opened 1 year ago

brett--anderson commented 1 year ago

Problem

Splash seems to throw the error: "ReferenceError: Can't find variable: IntersectionObserver" when loading certain websites. From what I can tell this error occurs in older browsers, like prior versions of Safari and I guess could be related to the version of WebKit Splash uses under the hood. Some Stackoverflow posts have stated that even the most recent version of Safari (2019 post) can still throw this error since the functionality was deemed experimental and older devices disable such features. I don't know if there is a way to tweak the Webkit configuration Splash uses? I've seen this on multiple high traffic sites so it seems like core functionality that other browsers have supported for a while now. I raised this issue with Zyte and their suggestion was to use Playwright or Puppeteer instead. I'm quite invested in a system built around Splash and don't have the time it would take to port everything over.

Steps to Reproduce

This is the only code that I'm running in a fresh notebook, from the Splash Jupyter notebook docker image that Zyte provides, set up successfully on OSX with XQuartz for the QT Webkit browser and inspection tool. To setup the notebook with splash:

brew install --cask xquartz
IP=$(/usr/sbin/ipconfig getifaddr en0) 
echo $IP 
/opt/X11/bin/xhost + "$IP"
docker run   -e QT_DEBUG_PLUGINS=1 \
             -e DISPLAY="$IP":0 \
             -v /tmp/.X11-unix:/tmp/.X11-unix \
             -v $XAUTHORITY:$XAUTHORITY \
             -e XAUTHORITY=$XAUTHORITY \
             -p 8888:8888 \
             -it scrapinghub/splash-jupyter --disable-xvfb

Then from a new Splash notebook instance:

splash:on_request(function (request)
      request:set_header('X-Crawlera-Cookies', 'disable')
      request:set_header('X-Crawlera-Profile', 'desktop')
      request:set_header('X-Crawlera-Timeout', '5000')
      request:set_proxy{
          host = "<proxy endpoint>",
          port = "8010",
          username = "<password>",
          password = ""
      }
end)

splash.private_mode_enabled = false
assert(splash.private_mode_enabled == false)

splash:go("https://byjus.com/question-answer/why-is-air-called-breath-of-life-enumerate-functions-of-air-or-atmosphere/")

After running this, parts of the page don't render and using the browser inspection tool provided for this splash browser I can see the InspectionObserver error being thrown in the console with cascading errors following. I've observed this on multiple sites now