scrapinghub / splash

Lightweight, scriptable browser as a service with an HTTP API
BSD 3-Clause "New" or "Revised" License
4.07k stars 513 forks source link

render fails when trying to render youtube with easylist filter #668

Open ghost opened 6 years ago

ghost commented 6 years ago

http://localhost:8050/render.html?url=http://youtube.com&filters=easylist

{"info": {"text": "Protocol \"\" is unknown", "url": "", "type": "Network", "code": 301}, "type": "RenderError", "error": 502, "description": "Error renderingpage"}``

rendering reddit.com works and easylist does apply to it. http://localhost:8050/render.html?url=http://reddit.com&filters=easylist

easylist.txt

kmike commented 6 years ago

There are many easylist filters, which one are you using? Are you sure it doesn't filter out youtube? Because it looks like this is the problem in your case - filter contains youtube URL, so request to youtube.com fails.

ghost commented 6 years ago

(I use aquarium right now) I got my easylist from https://github.com/TeamHG-Memex/aquarium/tree/master/%7B%7Bcookiecutter.folder_name%7D%7D/filters then i changed it to https://easylist-downloads.adblockplus.org/easylist.txt

both don't seem to render youtube for me. edit: it seem to render video pages just fine, yet fails on youtube.com

lucywang000 commented 6 years ago

I tried to render this page with easylist filter, it fails with exact the same error.

2017-09-28 09:06:08.419028 [render] [140513949654992] loadFinished: RenderErrorInfo(type='Network', code=301, text='Protocol "" is unknown', url='')

The easylist I use is https://easylist.to/easylist/easylist.txt

If I don't use request filter, the above error would not happen (though there are still some js execution problem, but that should be another story).

Vineeth-Mohan commented 6 years ago

@kmike - I am also facing this issue. We are using Splash version 2.3.3 Kindly look into it.

kmike commented 6 years ago

I think the way to go is to find an offending rule in easylist.txt list (or in whatever list you're using), and either remove it, or fix https://github.com/scrapinghub/adblockparser to handle it properly (if it is an issue with parsing).

Vineeth-Mohan commented 6 years ago

@kmike - As you can see from the debug log that the issue is coming way after the initial URL's are parsed and downloaded. So I dont think this is a case where filters are blocking the main URL

Vineeth-Mohan commented 6 years ago

@kmike - On further investigation , its found that there is a iframe inside

<!-- Google Tag Manager Start -->
  | <noscript><iframe src="//www.googletagmanager.com/ns.html?id=GTM-QSGL"
  | height="0" width="0" style="display:none;visibility:hidden"></iframe></noscript>
  | <script>(function(w,d,s,l,i){w[l]=w[l]\|\|[];w[l].push({'gtm.start':
  | new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],
  | j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src=
  | '//www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);
  | })(window,document,'script','dataLayer','GTM-QSGL');</script>
  | <!-- End Google Tag Manager -->
 ```

Content of www.googletagmanager.com/ns.html?id=GTM-QSGL

<!DOCTYPE html>

ns


We had blocked doubleclick.net in the filters.
Two question here - 

1. Event with iframe disabled by default ( And manually disabling iframe from render.json API ) Its rendering iframe
2. If the Iframe is evaluated  and its src is filtered out , should we be seeing complete page blackout ?
Vineeth-Mohan commented 6 years ago

@kmike - So the Issue have narrowed down to the fact that if there is a iframe which has a src , which is filtered out by the the filter list , then the entire site render gives failure.

Is there a way to disable rendering of iframes ?

mlwelles commented 6 years ago

+1 on this issue, It's causing my team a bunch of pain currently. Wading through each of 4mb of easylist filters to find which specific one is the cause for every given case where this occurs (and it occurs alot when dealing with heterogeneous page sources) isn't a fair response, especially for what seems clearly an failure in Splash to not handle errors gracefully.

It's a bug in the filter handling, and should be addressed.

mlwelles commented 6 years ago

I found a workaround for the issue. The trick is to change the url of the filtered request from being '' (an empty string) to be a valid url instead -- one that returns no data.

Specifically, I created an empty file and put it up at a public s3 url.

touch /tmp/empty.txt
aws s3 cp /tmp/empty.txt s3://static.[REDACTED].com/dev/null --acl public-read
# just to make make sure it's actually accessible
curl https://s3.amazonaws.com/static.[REDACTED].com/dev/null

And then in my /execute lua script, I did the equivalent of the following:

 splash:on_request(function (request)
      if request.url == '' then
     request.url = 'https://s3.amazonaws.com/static.[REDACTED].com/dev/null'
      end
end)

After which the network301 on the specific page I was fighting with that failed consistently when easylist filters were enabled cleared up completely. I suspect it'll be true for the others that have been problematic.

lucywang000 commented 6 years ago

@mlwelles Good trick! Maybe we can simply do a request.abort if url is empty?

brett--anderson commented 5 years ago

@mlwelles Thanks for the solution! For anyone else trying out the solution, I found that request.url can only be used for reading the URL attribute but not setting. To set the URL I think you need the line:

request:set_url()

0xIslamTaha commented 4 years ago

is there any updates about this issue? rendering doesnt block ads while Splash loaded the filters files