scrapinghub / splash

Lightweight, scriptable browser as a service with an HTTP API
BSD 3-Clause "New" or "Revised" License
4.09k stars 513 forks source link

network301 instead of http404 or successful render when filters are enabled #1023

Open lopuhin opened 4 years ago

lopuhin commented 4 years ago

The docs at https://splash.readthedocs.io/en/stable/api.html#request-filters say

Only related resources are filtered out by request filters; ‘main’ page loading request can’t be blocked this way.

But it seems that they are applied to the main request as well, at least it appears to be so.

Consider this lua script:

function main(splash, args)
  assert(splash:go(args.url))
  return 1
end

and start splash with filters:

mkdir test_filters
echo "books" > test_filters/test-filters.txt
docker run -v `pwd`/test_filters:/etc/splash/filters -p 8052:8050 scrapinghub/splash:3.4.1
...
2020-04-28 14:34:30.058728 [-] Loading filter test-filters
...

then make a request to http://books.toscrape.com/foo without filters:

curl 'http://localhost:8052/execute?url=http%3A%2F%2Fbooks.toscrape.com%2Ffoo&lua_source=function+main(splash%2C+args)%0D%0A++assert(splash%3Ago(args.url))%0D%0A++return+1%0D%0Aend'
{"error": 400, "type": "ScriptError", "description": "Error happened while executing Lua script", "info": {"source": "[string \"function main(splash, args)\r...\"]", "line_number": 2, "error": "http404", "type": "LUA_ERROR", "message": "Lua error: [string \"function main(splash, args)\r...\"]:2: http404"}}

this returned 404 as expected. Now with filters:

curl 'http://localhost:8052/execute?url=http%3A%2F%2Fbooks.toscrape.com%2Ffoo&lua_source=function+main(splash%2C+args)%0D%0A++assert(splash%3Ago(args.url))%0D%0A++return+1%0D%0Aend&filters=test-filters'
{"error": 400, "type": "ScriptError", "description": "Error happened while executing Lua script", "info": {"source": "[string \"function main(splash, args)\r...\"]", "line_number": 2, "error": "network301", "type": "LUA_ERROR", "message": "Lua error: [string \"function main(splash, args)\r...\"]:2: network301"}}

this returned network301 instead of http404.

For a page which gives a 200 we have a successful response even with filters enabled:

curl 'http://localhost:8052/execute?url=http%3A%2F%2Fbooks.toscrape.com&lua_source=function+main(splash%2C+args)%0D%0A++assert(splash%3Ago(args.url))%0D%0A++return+1%0D%0Aend&filters=test-filters'
1

If the URL does not match the filters we get 404 as we should:

curl 'http://localhost:8052/execute?url=http%3A%2F%2Fquotes.toscrape.com%2Ffoo&lua_source=function+main(splash%2C+args)%0D%0A++assert(splash%3Ago(args.url))%0D%0A++return+1%0D%0Aend&filters=test-filters'
{"error": 400, "type": "ScriptError", "description": "Error happened while executing Lua script", "info": {"source": "[string \"function main(splash, args)\r...\"]", "line_number": 2, "error": "http404", "type": "LUA_ERROR", "message": "Lua error: [string \"function main(splash, args)\r...\"]:2: http404"}}
sibiryakov commented 4 years ago

403 with description in the body seems clearer.

lopuhin commented 4 years ago

Note that on some pages, we get network301 instead of a successful render, e.g. for https://www.accc.gov.au/media-release/advertising-agents-warned-of-risks-of-breaching-trade-practices-act

echo "advertising" >> test_filters/test-filters.txt
# with filters
curl 'http://localhost:8052/execute?url=https%3A%2F%2Fwww.accc.gov.au%2Fmedia-release%2Fadvertising-agents-warned-of-risks-of-breaching-trade-practices-act&lua_source=function+main(splash%2C+args)%0D%0A++assert(splash%3Ago(args.url))%0D%0A++return+1%0D%0Aend&filters=test-filters'
{"error": 400, "type": "ScriptError", "description": "Error happened while executing Lua script", "info": {"source": "[string \"function main(splash, args)\r...\"]", "line_number": 2, "error": "network301", "type": "LUA_ERROR", "message": "Lua error: [string \"function main(splash, args)\r...\"]:2: network301"}}
# without filters
curl 'http://localhost:8052/execute?url=https%3A%2F%2Fwww.accc.gov.au%2Fmedia-release%2Fadvertising-agents-warned-of-risks-of-breaching-trade-practices-act&lua_source=function+main(splash%2C+args)%0D%0A++assert(splash%3Ago(ars.url))%0D%0A++return+1%0D%0Aend'
1

this pages gives a 304 (cached) on second render, but this error happens almost immediately regardless of the order of requests and even as the first request after restart.

Here are the logs:

2020-04-28 15:15:15.099973 [render] [139617798218360] viewport size is set to 1024x768
2020-04-28 15:15:15.100052 [pool] [139617798218360] SLOT 0 is starting         
2020-04-28 15:15:15.100089 [render] [139617798218360] function main(splash, args)\r\n  assert(splash:go(args.url))\r\n  return 1\r\nend
2020-04-28 15:15:15.102196 [render] [139617798218360] [lua_runner] dispatch cmd_id=__START__
2020-04-28 15:15:15.102238 [render] [139617798218360] [lua_runner] arguments are for command __START__, waiting for result of __START__
2020-04-28 15:15:15.102270 [render] [139617798218360] [lua_runner] entering dispatch/loop body, args=()
2020-04-28 15:15:15.102297 [render] [139617798218360] [lua_runner] send None   
2020-04-28 15:15:15.102325 [render] [139617798218360] [lua_runner] send (lua) None
2020-04-28 15:15:15.102414 [render] [139617798218360] [lua_runner] got AsyncBrowserCommand(id=None, name='go', kwargs={'url': 'https://www.accc.gov.au/media-release/advertising-agents-warned-of-risks-of-breaching-trade-practices-act', 'baseurl': None, 'callback': '
<a callback>', 'errback': '<an errback>', 'http_method': 'GET', 'body': None, 'headers': None})
2020-04-28 15:15:15.102460 [render] [139617798218360] [lua_runner] instructions used: 70
2020-04-28 15:15:15.102494 [render] [139617798218360] [lua_runner] executing AsyncBrowserCommand(id=0, name='go', kwargs={'url': 'https://www.accc.gov.au/media-release/advertising-agents-warned-of-risks-of-breaching-trade-practices-act', 'baseurl': None, 'callback'
: '<a callback>', 'errback': '<an errback>', 'http_method': 'GET', 'body': None, 'headers': None})
2020-04-28 15:15:15.102526 [render] [139617798218360] HAR event: _onStarted    
2020-04-28 15:15:15.102588 [render] [139617798218360] callback 0 is connected to loadFinished
2020-04-28 15:15:15.103311 [network] [139617798218360] GET https://www.accc.gov.au/media-release/advertising-agents-warned-of-risks-of-breaching-trade-practices-act
2020-04-28 15:15:15.103461 [request_middleware] Filter test-filters: dropped 139617798218360 GET https://www.accc.gov.au/media-release/advertising-agents-warned-of-risks-of-breaching-trade-practices-act
2020-04-28 15:15:15.104235 [pool] [139617798218360] SLOT 0 is working          
2020-04-28 15:15:15.104282 [pool] [139617798218360] queued                     
2020-04-28 15:15:15.104373 [QAbstractEventDispatcher] awake; block time: 0.0209
2020-04-28 15:15:15.104403 [QAbstractEventDispatcher] aboutToBlock             
2020-04-28 15:15:15.104553 [-] ErrorPageExtension in WebkitWebPage.extension   
2020-04-28 15:15:15.108445 [render] [139617798218360] loadFinished: unknown error
2020-04-28 15:15:15.108495 [render] [139617798218360] loadFinished: disconnecting callback 0
2020-04-28 15:15:15.108541 [render] [139617798218360] [lua_runner] dispatch cmd_id=0
2020-04-28 15:15:15.108569 [render] [139617798218360] [lua_runner] arguments are for command 0, waiting for result of 0
2020-04-28 15:15:15.108602 [render] [139617798218360] [lua_runner] entering dispatch/loop body, args=(PyResult('return', None, 'network301'),)
2020-04-28 15:15:15.108632 [render] [139617798218360] [lua_runner] send PyResult('return', None, 'network301')
2020-04-28 15:15:15.108667 [render] [139617798218360] [lua_runner] send (lua) (b'return', None, b'network301')
2020-04-28 15:15:15.108724 [render] [139617798218360] [lua_runner] instructions used: 79
2020-04-28 15:15:15.108756 [render] [139617798218360] [lua_runner] caught LuaError LuaError('[string "function main(splash, args)\\r..."]:2: network301',)
2020-04-28 15:15:15.108852 [pool] [139617798218360] SLOT 0 finished with an error <splash.qtrender_lua.LuaRender object at 0x7efb4c407470>: [Failure instance: Traceback: <class 'splash.errors.ScriptError'>: {'source': '[string "function main(splash, args)\r..."]',
'line_number': 2, 'error': 'network301', 'type': 'LUA_ERROR', 'message': 'Lua error: [string "function main(splash, args)\r..."]:2: network301'}
        /app/splash/engines/webkit/browser_tab.py:501:_on_content_ready        
        /app/splash/qtrender_lua.py:714:error                                  
        /app/splash/lua_runner.py:27:return_result                             
        /app/splash/render_scripts.py:21:stop_on_error_wrapper                 
        --- <exception caught here> ---                                        
        /app/splash/render_scripts.py:19:stop_on_error_wrapper                   
        /app/splash/qtrender_lua.py:2343:dispatch                              
        /app/splash/lua_runner.py:195:dispatch                                 
        ]