scrapinghub / splash

Lightweight, scriptable browser as a service with an HTTP API
BSD 3-Clause "New" or "Revised" License
4.1k stars 513 forks source link

Header transfer-encoding make Splash API return 504 Gateway Timeout #932

Open Urahara opened 5 years ago

Urahara commented 5 years ago

I was developing a crawler using Splash when suddenly i started to receive a lot of gateway timeouts. Trying to troubleshooting the problem, i discover the cause of this is header transfer-encoding: chunked, i made a PoC (the url httpbin.org/headers returns the same headers i sent on request):

import requests
import json

ENDPOINT_SPLASH = 'http://localhost:8050/execute'

def test_with_custom_headers():
    lua_script = """
    function main(splash, args)
     splash:set_custom_headers({
       ["x-custom-header"] = "splash"
     })
     assert(splash:go(args.url))
     assert(splash:wait(0.5))
     return {
       html = splash:html()
     }
    end
    """

    payload = {
        'lua_source': lua_script,
        'url': 'https://httpbin.org/headers',
        'timeout': 15,
    }

    r = requests.post(url=ENDPOINT_SPLASH,
                      json=payload)

    result = json.loads(r.text)

    return result.get('html', result)

def test_with_content_encoding():
    lua_script = """
    function main(splash, args)
     splash:set_custom_headers({
       ["transfer-encoding"] = "chunked"
     })
     assert(splash:go(args.url))
     assert(splash:wait(0.5))
     return {
       html = splash:html()
     }
    end
    """

    payload = {
        'lua_source': lua_script,
        'url': 'https://httpbin.org/headers',
        'timeout': 15,
    }

    r = requests.post(url=ENDPOINT_SPLASH,
                      json=payload)

    result = json.loads(r.text)

    return result.get('html', result)

print("test_with_custom_headers: \n{}\n".format(test_with_custom_headers()))
print("test_with_content_encoding: \n{}".format(test_with_content_encoding()))

Results:

test_with_custom_headers: 
<html><head></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">{
  "headers": {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", 
    "Accept-Encoding": "gzip, deflate", 
    "Accept-Language": "en,*", 
    "Host": "httpbin.org", 
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/602.1 (KHTML, like Gecko) splash Version/9.0 Safari/602.1", 
    "X-Custom-Header": "splash"
  }
}
</pre></body></html>

test_with_content_encoding: 
{'info': {'timeout': 15.0}, 'type': 'GlobalTimeoutError', 'error': 504, 'description': 'Timeout exceeded rendering page'}
Granitosaurus commented 4 years ago

I'm having the same issue but weirdly enough only when using proxies via splash:on_request. My splash is patched with decompression patch described in this issue: https://github.com/scrapinghub/splash/issues/324 if you aren't using proxies this might solve the issue for you.

nirvana-msu commented 4 years ago

I'm having a similar issue. In my case it's not necessarily transfer-encoding that is causing problems but some other headers as well. The issue is only observed when using a proxy, and only for HTTPS URLs. Removing proxy or crawling HTTP URLs works fine.

What's even more interesting is that I only have a problem when I use HTTP proxy. If I use SOCKS5 proxy instead then it works.

0xfede7c8 commented 4 years ago

Is this fixed?

agg23 commented 4 years ago

This, or something similar, appears to be occurring to me using an HTTP proxy as well.

bpgallagher commented 3 years ago

I'm having the same issue. Is there a solution for this problem yet? Thank you.