Open ghost opened 6 years ago
There are many easylist filters, which one are you using? Are you sure it doesn't filter out youtube? Because it looks like this is the problem in your case - filter contains youtube URL, so request to youtube.com fails.
(I use aquarium right now) I got my easylist from https://github.com/TeamHG-Memex/aquarium/tree/master/%7B%7Bcookiecutter.folder_name%7D%7D/filters then i changed it to https://easylist-downloads.adblockplus.org/easylist.txt
both don't seem to render youtube for me. edit: it seem to render video pages just fine, yet fails on youtube.com
I tried to render this page with easylist filter, it fails with exact the same error.
2017-09-28 09:06:08.419028 [render] [140513949654992] loadFinished: RenderErrorInfo(type='Network', code=301, text='Protocol "" is unknown', url='')
The easylist I use is https://easylist.to/easylist/easylist.txt
If I don't use request filter, the above error would not happen (though there are still some js execution problem, but that should be another story).
@kmike - I am also facing this issue. We are using Splash version 2.3.3 Kindly look into it.
I think the way to go is to find an offending rule in easylist.txt list (or in whatever list you're using), and either remove it, or fix https://github.com/scrapinghub/adblockparser to handle it properly (if it is an issue with parsing).
@kmike - As you can see from the debug log that the issue is coming way after the initial URL's are parsed and downloaded. So I dont think this is a case where filters are blocking the main URL
@kmike - On further investigation , its found that there is a iframe inside
<!-- Google Tag Manager Start -->
| <noscript><iframe src="//www.googletagmanager.com/ns.html?id=GTM-QSGL"
| height="0" width="0" style="display:none;visibility:hidden"></iframe></noscript>
| <script>(function(w,d,s,l,i){w[l]=w[l]\|\|[];w[l].push({'gtm.start':
| new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],
| j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src=
| '//www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);
| })(window,document,'script','dataLayer','GTM-QSGL');</script>
| <!-- End Google Tag Manager -->
```
Content of www.googletagmanager.com/ns.html?id=GTM-QSGL
<!DOCTYPE html>
We had blocked doubleclick.net in the filters.
Two question here -
1. Event with iframe disabled by default ( And manually disabling iframe from render.json API ) Its rendering iframe
2. If the Iframe is evaluated and its src is filtered out , should we be seeing complete page blackout ?
@kmike - So the Issue have narrowed down to the fact that if there is a iframe which has a src , which is filtered out by the the filter list , then the entire site render gives failure.
Is there a way to disable rendering of iframes ?
+1 on this issue, It's causing my team a bunch of pain currently. Wading through each of 4mb of easylist filters to find which specific one is the cause for every given case where this occurs (and it occurs alot when dealing with heterogeneous page sources) isn't a fair response, especially for what seems clearly an failure in Splash to not handle errors gracefully.
It's a bug in the filter handling, and should be addressed.
I found a workaround for the issue. The trick is to change the url of the filtered request from being '' (an empty string) to be a valid url instead -- one that returns no data.
Specifically, I created an empty file and put it up at a public s3 url.
touch /tmp/empty.txt
aws s3 cp /tmp/empty.txt s3://static.[REDACTED].com/dev/null --acl public-read
# just to make make sure it's actually accessible
curl https://s3.amazonaws.com/static.[REDACTED].com/dev/null
And then in my /execute lua script, I did the equivalent of the following:
splash:on_request(function (request)
if request.url == '' then
request.url = 'https://s3.amazonaws.com/static.[REDACTED].com/dev/null'
end
end)
After which the network301 on the specific page I was fighting with that failed consistently when easylist filters were enabled cleared up completely. I suspect it'll be true for the others that have been problematic.
@mlwelles Good trick! Maybe we can simply do a request.abort
if url is empty?
@mlwelles Thanks for the solution! For anyone else trying out the solution, I found that request.url can only be used for reading the URL attribute but not setting. To set the URL I think you need the line:
request:set_url()
is there any updates about this issue? rendering doesnt block ads while Splash loaded the filters files
http://localhost:8050/render.html?url=http://youtube.com&filters=easylist
{"info": {"text": "Protocol \"\" is unknown", "url": "", "type": "Network", "code": 301}, "type": "RenderError", "error": 502, "description": "Error rendering
page"}``rendering reddit.com works and easylist does apply to it. http://localhost:8050/render.html?url=http://reddit.com&filters=easylist
easylist.txt