scrapinghub / splash

Lightweight, scriptable browser as a service with an HTTP API
BSD 3-Clause "New" or "Revised" License
4.09k stars 513 forks source link

Sometimes meta refresh redirection is not performed #517

Open starrify opened 8 years ago

starrify commented 8 years ago

Issue

Sometimes meta refresh redirection is not performed, even after a 10 second wait.

Affected versions

It's confirmed to be reproducible in version 2.1, 2.2, and 2082c4b which is the latest commit at this moment.

Steps to reproduce, and other helpful info

Here is a sample page to help reproducing the issue:

[pengyu@GLaDOS-Precision-7510 temp]$ curl http://libstarrify.so/rodata/test_1.html
<html><head><meta http-equiv="REFRESH" content="0;http://httpbin.org/get"></head><body></body></html>

In the test below, the Splash service is provided by a Docker image built from 2082c4b. For the test, a new container is started and used by only the test.

This is a minimal example that could reproduce this issue. Start with this following Lua script:

function wait_restarting_on_redirects(splash)
  local redirects_remaining = 10
  while redirects_remaining > 0 do
    local ok, reason = splash:wait{
      time = 10,
      cancel_on_redirect = true,
    }
    if reason ~= 'redirect' then
      return ok, reason
    end
    redirects_remaining = redirects_remaining - 1
  end
  return nil, "too_many_redirects"
end

function main(splash)
  local url = splash.args.url
  assert(splash:go(url))
  local ok, reason = wait_restarting_on_redirects(splash)
  return {
    url = splash:url(),
    html = splash:html(),
    png = splash:png(),
    har = splash:har(),
    ok = ok,
    reason = reason,
  }
end

And try rendering http://libstarrify.so/rodata/test_1.html for 100 times:

for i in {1..100}; do curl "http://localhost:8050/execute?wait=0.5&images=1&expand=1&timeout=600.0&url=http%3A%2F%2Flibstarrify.so%2Frodata%2Ftest_1.html&lua_source=function+wait_restarting_on_redirects%28splash%29%0D%0A++local+redirects_remaining+%3D+10%0D%0A++while+redirects_remaining+%3E+0+do%0D%0A++++local+ok%2C+reason+%3D+splash%3Await%7B%0D%0A++++++time%3D5%2C%0D%0A++++++cancel_on_redirect%3Dtrue%0D%0A++++%7D%0D%0A++++if+reason+~%3D+%27redirect%27+then%0D%0A++++++return+ok%2C+reason%0D%0A++++end%0D%0A++++redirects_remaining+%3D+redirects_remaining+-+1%0D%0A++end%0D%0A++return+nil%2C+%22too_many_redirects%22%0D%0Aend%0D%0A%0D%0Afunction+main%28splash%29%0D%0A++splash%3Aon_request%28function+%28request%29%0D%0A++++request%3Aset_proxy%7B%0D%0A++++++%27proxy.crawlera.com%27%2C%0D%0A++++++%278010%27%2C%0D%0A++++++username%3D%276f8829027633409abb37e065eb6dc92b%27%2C%0D%0A++++++password%3D%27%27%0D%0A++++%7D%0D%0A++end%29%0D%0A++%0D%0A++local+url+%3D+splash.args.url%0D%0A++assert%28splash%3Ago%28url%29%29%0D%0A++assert%28wait_restarting_on_redirects%28splash%29%29%0D%0A++return+%7B%0D%0A++++url+%3D+splash%3Aurl%28%29%2C%0D%0A++++html+%3D+splash%3Ahtml%28%29%2C%0D%0A++++png+%3D+splash%3Apng%28%29%2C%0D%0A++++har+%3D+splash%3Ahar%28%29%2C%0D%0A++%7D%0D%0Aend" >> test_1.jl; done

2 out of those 100 runs gave positive results where the redirection did not happen:

[pengyu@GLaDOS-Precision-7510 temp]$ cat test_1.jl | jq '.url' | sort | uniq -c
     98 "http://httpbin.org/get"
      2 "http://libstarrify.so/rodata/test_1.html"

Here is some further info that might help investigating:

[pengyu@GLaDOS-Precision-7510 temp]$ cat test_1.jl | jq 'select(.url=="http://libstarrify.so/rodata/test_1.html")' -c | head -n 1 | jq '{url:.url, html:.html, har:.har}'
{
  "url": "http://libstarrify.so/rodata/test_1.html",
  "html": "<html><head><meta http-equiv=\"REFRESH\" content=\"0;http://httpbin.org/get\"></head><body>\n</body></html>",
  "har": {
    "log": {
      "browser": {
        "comment": "PyQt 5.5.1, Qt 5.5.1",
        "name": "QWebKit",
        "version": "538.1"
      },
      "pages": [
        {
          "startedDateTime": "2016-10-25T13:50:47.193256Z",
          "id": "1",
          "title": "",
          "pageTimings": {
            "onLoad": 3,
            "_onHtmlRendered": 9953,
            "onContentLoad": 3,
            "_onStarted": 2,
            "_onScreenshotPrepared": 9957,
            "_onPngRendered": 9988
          }
        }
      ],
      "version": "1.2",
      "entries": [
        {
          "time": 0,
          "startedDateTime": "2016-10-25T13:50:47.197128Z",
          "cache": {},
          "_splash_processing_state": "created",
          "response": {
            "ok": true,
            "statusText": "OK",
            "status": 0,
            "bodySize": -1,
            "redirectURL": "",
            "cookies": [],
            "headersSize": 0,
            "url": "http://httpbin.org/get",
            "headers": [],
            "content": {
              "size": 0,
              "mimeType": ""
            },
            "httpVersion": "HTTP/1.1"
          },
          "timings": {
            "connect": -1,
            "send": 0,
            "dns": -1,
            "blocked": -1,
            "ssl": -1,
            "receive": 0,
            "wait": 0
          },
          "request": {
            "method": "GET",
            "url": "http://httpbin.org/get",
            "bodySize": -1,
            "queryString": [],
            "cookies": [],
            "headersSize": 227,
            "headers": [
              {
                "value": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
                "name": "Accept"
              },
              {
                "value": "http://libstarrify.so/rodata/test_1.html",
                "name": "Referer"
              },
              {
                "value": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/538.1 (KHTML, like Gecko) splash Safari/538.1",
                "name": "User-Agent"
              }
            ],
            "httpVersion": "HTTP/1.1"
          },
          "pageref": "1"
        }
      ],
      "creator": {
        "name": "Splash",
        "version": "2.2.1"
      }
    }
  }
}

Questions

Does this HAR entry mean that the request to http://httpbin.org/get was cached ("ok": true and "statusText": "OK")? (If so, there shall be something else wrong that prevents the redirection)
If otherwise the request is failed, can splash:wait return some info indicating that? (e.g. we may receive nil, "network2" from a call to splash:go)

kmike commented 7 years ago

Hey @starrify,

I've seen problems with redirects; they sometimes affect Splash testing suite. For some reason it is also a much larger issue on OS X than in Ubuntu (or in Docker).

I'm not sure how to fix that; it looks like a bug somewhere deep in qt/qtwebkit/webkit. We can add qtwebengine support (https://github.com/scrapinghub/splash/issues/349) or use a qtwebkit fork with updated engine (https://github.com/annulen/webkit), maybe this can help. But it is a huge amount of work.