Placeholder issue for discussion of issues in ABP/AdGuard issue tracker -- and possible solutions

gorhill commented 8 years ago

[Intentionally empty]

gorhill commented 8 years ago

Regarding issue https://issues.adblockplus.org/ticket/2278:

@kzar, @ameshkov

Being able to have a token for regex-based filters would definitely help performance. However trying to programmatically extract a token from a regex-based filter sounds scary to me, too much risk of extracting erroneous tokens.

Suggestion: create a new filter option, token=[...], which filter creators can use to assign a predefined token to the filter. The creator of a filter is best placed to figure if and what token will work to store the filter internally.

For example, this filter in EasyList:

/\.filenuke\.com/.*[a-zA-Z0-9]{4}/$script

Could simply have been written by a filter creator:

/\.filenuke\.com/.*[a-zA-Z0-9]{4}/$script,token=filenuke

ameshkov commented 8 years ago

Hey guys! I was thinking about solving this issue a while ago. Even tried to implement a simple token-extracting algorithm. I will post my ideas a bit later though.

Meanwhile, here is a list of known regexp rules:

/^(?![a-z]+\:\/+([^\/\:]+\.(il\|com\|net)\|[\.0-9]+\|([^\/\:\.]+\.)*(spot\.im\|vine\.co\|periscope\.tv\|vid\.me\|mako\.tools\|minidom\.org\|jquerymin\.org\|logidea\.info\|zoomanalytics\.co\|firstimpression\.io))\.?([\/\:]\|$))^[^\/\:\.]+\:\/+[^\/\:\.]/$third-party,domain=mako.co.il	EasyList Hebrew	https://github.com/AdBlockPlusIsrael/EasyListHebrew
/^(?![a-z]+\:\/+([^\/\:\.]+\.)*(google\|icdn\|auto\|sport5\|smartair\|mysupermarket\|blms\|linicom)\.co\.il\.?([\/\:]\|$))^[a-z]+\:\/+[^\/\:]+\.il\.?([\/\:]\|$)/$third-party,domain=mako.co.il	EasyList Hebrew	https://github.com/AdBlockPlusIsrael/EasyListHebrew
/^[a-z]+\:\/+[\.0-9]+([\/\:]\|$)/$image,media,object,script,stylesheet,subdocument,third-party,domain=mako.co.il	EasyList Hebrew	https://github.com/AdBlockPlusIsrael/EasyListHebrew
/^(?![a-z]+\:\/+([^\/\:\.]+\.)*(fbcdn\|cloudfront\|facebook\|akamaihd\|ctedgecdn\|2mdn\|uploaditnow\|edgesuite\|doubleclick\|dmcdn\|slideshare\|advsnx)\.net\.?([\/\:]\|$))^[a-z]+\:\/+[^\/\:]+\.net\.?([\/\:]\|$)/$third-party,domain=mako.co.il	EasyList Hebrew	https://github.com/AdBlockPlusIsrael/EasyListHebrew
/^(?![a-z]+\:\/+([^\/\:\.]+\.)*(google\|facebook\|twitter\|instagram\|youtube\|jquery\|googleapis\|vicomi\|twimg\|cdninstagram\|pinterest\|pinimg\|giphy\|playbuzz\|outbrain\|ytimg\|amazonaws\|cloudflare\|gstatic\|sniperm\|dinovich\|shortaudition\|linkedin\|opinionstage\|vimeo\|vimeocdn\|dailymotion\|flickr\|staticflickr\|tumblr\|soundcloud\|scribd\|syteapi\|addthis\|addthisedge\|reddit\|disqus\|disquscdn\|apester\|qmerce\|taboola\|taboolasyndication\|google-analytics\|googletagservices\|googletagmanager\|googleadservices\|googlesyndication\|h-cdn\|scorecardresearch\|serving-sys\|bootstrapcdn\|tiviclick\|ruchlis\|hotjar\|flx1\|mxpnl\|themarker\|adnxs\|conduit\|fourtips\|makojs)\.com\.?([\/\:]\|$))^[a-z]+\:\/+[^\/\:]+\.com\.?([\/\:]\|$)/$third-party,domain=mako.co.il	EasyList Hebrew	https://github.com/AdBlockPlusIsrael/EasyListHebrew
/quang%20cao/	ABPVN List	http://abpvn.com/
/YanAds/	ABPVN List	http://abpvn.com/
/www/images/	ABPVN List	http://abpvn.com/
/ads-pic/	Adblock-Persian list	http://ideone.com/K452p
/eshop-eca/	Adblock-Persian list	http://ideone.com/K452p
/eshop98/	Adblock-Persian list	http://ideone.com/K452p
/402x192/	Adblock-Persian list	http://ideone.com/K452p
/^http://m\.autohome\.com\.cn\/[a-z0-9]{32}\//$domain=m.autohome.com.cn	ChinaList+EasyList	http://www.adtchrome.com/extension/adt-chinalist-easylist.html
/^http://www\.tt1069\.com\/(?!bbs)/$script,domain=tt1069.com	ChinaList+EasyList	http://www.adtchrome.com/extension/adt-chinalist-easylist.html
/^http://www\.iqiyi\.com\/common\/flashplayer\/[0-9]{8}/[0-9a-z]{32}.swf/$domain=iqiyi.com	ChinaList+EasyList	http://www.adtchrome.com/extension/adt-chinalist-easylist.html
/^http://www\.dnvod\.eu.*?\/[a-z0-9]{9,}\.swf/$domain=dnvod.eu	ChinaList+EasyList	http://www.adtchrome.com/extension/adt-chinalist-easylist.html
/NetInsight/text/$domain=~ads.pandora.tv\|~opt.mgoon.com	Korean Adblock List	https://github.com/gfmaster/adblock-korea-contrib
/omniture/	Korean Adblock List	https://github.com/gfmaster/adblock-korea-contrib
/NetInsight/html/	Korean Adblock List	https://github.com/gfmaster/adblock-korea-contrib
/cgi-bin/conad.fcgi/	Korean Adblock List	https://github.com/gfmaster/adblock-korea-contrib
/acecounter/$domain=~acecounter.com	Korean Adblock List	https://github.com/gfmaster/adblock-korea-contrib
/adNdsoft/	Korean Adblock List	https://github.com/gfmaster/adblock-korea-contrib
/wisenut/	Korean Adblock List	https://github.com/gfmaster/adblock-korea-contrib
/ad-pay/	Korean Adblock List	https://github.com/gfmaster/adblock-korea-contrib
/wp-content/plugins/google-analyticator/	Korean Adblock List	https://github.com/gfmaster/adblock-korea-contrib
/realclick/	Korean Adblock List	https://github.com/gfmaster/adblock-korea-contrib
/max-banner-ads-pro/	Korean Adblock List	https://github.com/gfmaster/adblock-korea-contrib
/RealMedia/	Korean Adblock List	https://github.com/gfmaster/adblock-korea-contrib
/bannerManager/	Korean Adblock List	https://github.com/gfmaster/adblock-korea-contrib
/autoPage/	Korean Adblock List	https://github.com/gfmaster/adblock-korea-contrib
/overture/	Korean Adblock List	https://github.com/gfmaster/adblock-korea-contrib
/wiseAd/euckr/inc/$subdocument	Korean Adblock List	https://github.com/gfmaster/adblock-korea-contrib
/NetInsight/js/	Korean Adblock List	https://github.com/gfmaster/adblock-korea-contrib
/scrap_logs/	Korean Adblock List	https://github.com/gfmaster/adblock-korea-contrib
/banner_event/	Korean Adblock List	https://github.com/gfmaster/adblock-korea-contrib
/images/adpresso/	Korean Adblock List	https://github.com/gfmaster/adblock-korea-contrib
/AdBanner/	Korean Adblock List	https://github.com/gfmaster/adblock-korea-contrib
/cdsbData_gal/bannerFile/$image,domain=mybogo.net\|zipbogo.net	List-KR	https://list-kr.github.io/
/nad/media/	List-KR	https://list-kr.github.io/
/ajrotator/	Filtros Nauscopicos	http://nauscopio.nireblog.com/cat/filtrado
/:\/\/(?!biuropodrozy)(?!liveblog)(?!relacje)(?!opinie)(?!zalacznik)(?!magazyn)(?!newsletter)(?!rodzinnawycieczka)(?!doladowania)(?!fantasyliga)(?!funduszeue)(?!imperiumstylu)(?!kodyrabatowe)(?!ogloszenia)(?!orangekinoletnie)(?!rekrutacja)(?!rycerzeiksiezniczki)(?!speedwaymanager)(?!sportowefakty)(?!sportowybar)(?!talesofmagic)(?!ubezpieczenia)(?!warofdragons)(?!wiadomosci)[a-zA-Z0-9]{10,}\.wp.pl\//	Adblock polskie reguły	http://certyficate.it/polski-filtr-adblock/
/:\/\/(?!biuropodrozy)(?!liveblog)(?!relacje)(?!opinie)(?!zalacznik)(?!magazyn)(?!newsletter)(?!facet)(?!wyleczto)(?!kuchnia)(?!film)(?!moto)(?!gwiazdy)(?!teleshow)(?!finanse)(?!kobieta)(?!dom)(?!pogoda)(?!tech)(?!historia)(?!czat)(?!ksiazki)(?!gryonline)(?!hotele)(?!narty)(?!samoloty)(?!wycieczki)(?!hosting)(?!irlandia)(?!multikurs)(?!casino)(?!foto)(?!tech)(?!www)(?!stg)(?!doladowania)(?!fantasyliga)(?!funduszeue)(?!imperiumstylu)(?!kodyrabatowe)(?!alefolwark)(?!angielski)(?!arenamody)(?!beniamin)(?!bon)(?!bsg)(?!casino)(?!diety)(?!dlaprasy)(?!dlugi)(?!doladowania)(?!dom)(?!dysk)(?!ebiznes)(?!ebooki)(?!empire)(?!fantasyliga)(?!film)(?!fundusze)(?!ogloszenia)(?!orangekinoletnie)(?!rekrutacja)(?!rycerzeiksiezniczki)(?!speedwaymanager)(?!sportowefakty)(?!sportowybar)(?!talesofmagic)(?!ubezpieczenia)(?!warofdragons)(?!wiadomosci)(?!gazetki)(?!gry)(?!horoskop)(?!kalendarz)(?!katalog)(?!khanwars)(?!komiks)(?!konflikty)(?!kontakty)(?!korsarze)(?!kultura)(?!mini)(?!mmho)(?!mobilna)(?!morizon)(?!moto)(?!muzyka)(?!narty)(?!naryby)(?!onas)(?!orangekinoletnie)(?!piraci)(?!poczta)(?!pomoc)(?!praca)(?!profil)(?!programtv)(?!pytamy)(?!rekrutacja)(?!rss)(?!rtvagd)(?!rycerzeiksiezniczki)(?!smeet)(?!speedwaymanager)(?!szkola)(?!szukaj)(?!tech)(?!teleshow)(?!triviador)(?!turystyka)(?!twojeip)(?!ulubiency)(?!warodfragons)(?!wycieczki)(?!zdrowie)(?!zoomumba)(?!topnews)(?!erotyka)(?!dzieci)(?!fitness)(?!gielda)(?!finansomat)(?!biznes)(?!sport)[a-zA-Z0-9]{4,9}\.wp.pl\//	Adblock polskie reguły	http://certyficate.it/polski-filtr-adblock/
/commoncfm/images/microsoftxboxone/$domain=buffed.de\|gamesaktuell.de\|gamezone.de\|pcgames.de\|videogameszone.de	German filter	http://adguard.com/filters.html#german
/[a-z0-9]{32,}/$third-party,domain=picshare.ru	Russian filter	http://adguard.com/filters.html#russian
/[a-zA-Z0-9]{35,}/$script,third-party,domain=bigtorrent.org\|bigtorrents.ru\|cashtube.ru\|cmexota.ru\|dreamprogs.net\|dsvload.net\|ecsebo.ru\|enotbox.com\|faspiic.ru\|imagefile.org\|imgpay.ru\|kordonivkakino.net\|mcdownloads.ru\|mega-pic.org\|odnopolchane.net\|payforpic.ru\|pic4cash.ru\|pic4you.ru\|picclick.ru\|picforall.ru\|pics-money.ru\|pirat-pic.ru\|planeta51.com\|pronpic.org\|prons.org\|q32.ru\|rustorrents.net\|santikov.net\|sharezones.biz\|torrent-pirat.com\|unionpeer.org\|uraltrack.net\|viewy.ru\|xhamster-pic.com	Russian filter	http://adguard.com/filters.html#russian
/http:\/\/rustorka.com\/[a-z]+\.js/$domain=rustorka.com	Russian filter	http://adguard.com/filters.html#russian
/http:\/\/rustorka.com\/[a-z0-9]+\.(jpg\|gif)/$image,domain=rustorka.com	Russian filter	http://adguard.com/filters.html#russian
/[a-zA-Z0-9]{35,}/$domain=anime-free.net\|cyberpirate.me\|imgbum.net\|online-porno-hd.ru\|tecnomectrani.com	Russian filter	http://adguard.com/filters.html#russian
/[a-z0-9]{30,}/$script,third-party,domain=free-torrent.org\|free-torrents.org	Russian filter	http://adguard.com/filters.html#russian
/^http://[a-z0-9_]{15,}\.[a-z0-9-]+\.[a-z]{2,}\/.*[a-zA-Z0-9]{100,}/$object-subrequest,domain=wat.tv	Liste FR	http://adblock-listefr.com/
/^http://[a-z0-9_-]{10,}\.[a-z0-9-]+\.[a-z]{2,}\/.*?\w{30,}/$~xmlhttprequest,domain=gentside.com\|maxisciences.com\|ohmymag.com	Liste FR	http://adblock-listefr.com/
/content/stargate/$domain=hlamer.ru\|kadu.ru\|krasview.ru	RU AdList	https://code.google.com/p/ruadlist/
/output/index/$third-party,script	RU AdList	https://code.google.com/p/ruadlist/
/https?://(?!(mc\.yandex\.ru\|www\.google-analytics\.com)/)/$third-party,script,subdocument,domain=massivmebel.by	RU AdList	https://code.google.com/p/ruadlist/
/^https?://goodgame\.ru/[a-z0-9]+$/$subdocument,domain=goodgame.ru	RU AdList	https://code.google.com/p/ruadlist/
/wp-content/plugins/popup-maker/$domain=info-life.in.ua\|intermarium.com.ua\|paragraf.net.ua\|unn24.com.ua\|varota.com.ua	RU AdList	https://code.google.com/p/ruadlist/
/^https?://(?!static\.)([^.]+\.)+?fastpic\.ru[:/]/$script,domain=fastpic.ru	RU AdList	https://code.google.com/p/ruadlist/
/images/brandings/$image,domain=sc2tv.ru	RU AdList	https://code.google.com/p/ruadlist/
/default/vbanners/$domain=noi.md	RU AdList	https://code.google.com/p/ruadlist/
/branding/$subdocument,domain=fanserials.tv\|kino-filmi.net	RU AdList	https://code.google.com/p/ruadlist/
/serial_adv_files/$image,domain=xn--80aacbuczbw9a6a.xn--p1ai\|куражбамбей.рф	RU AdList	https://code.google.com/p/ruadlist/
/^https?://(?!www\.)([^.]+\.)+?(kordonivkakino\.net\|m(ac-torrent-download\.net\|oviki\.ru))[:/]/$script	RU AdList	https://code.google.com/p/ruadlist/
/popupclick/$popup	RU AdList	https://code.google.com/p/ruadlist/
/http://[a-zA-Z0-9]+\.[a-z]+\/.(?:[!"#$%&'()+,:;<=>?@/\^_`{\|}~-]).*[a-zA-Z0-9]+/$script,third-party,domain=keezmovies.com\|redtube.com\|tube8.com\|tube8.es\|tube8.fr\|www.pornhub.com\|youporn.com	EasyList	https://easylist.github.io/
/\/[0-9].\-.\-[a-z0-9]{4}/$script,xmlhttprequest,domain=gaytube.com\|keezmovies.com\|spankwire.com\|tube8.com\|tube8.es\|tube8.fr	EasyList	https://easylist.github.io/
/\.sharesix\.com/.*[a-zA-Z0-9]{4}/$script	EasyList	https://easylist.github.io/
/\.filenuke\.com/.*[a-zA-Z0-9]{4}/$script	EasyList	https://easylist.github.io/
/^http://m\.autohome\.com\.cn\/[a-z0-9]{32}\//$domain=m.autohome.com.cn	EasyList China	http://abpchina.org/forum/
/^http://www\.iqiyi\.com\/common\/flashplayer\/[0-9]{8}/[0-9a-z]{32}.swf/$domain=iqiyi.com	EasyList China	http://abpchina.org/forum/
/^http://www\.dnvod\.eu.*?\/[a-z0-9]{9,}\.swf/$domain=dnvod.eu	EasyList China	http://abpchina.org/forum/
/^http://www\.tt1069\.com\/(?!bbs)/$script,domain=tt1069.com	EasyList China	http://abpchina.org/forum/
/ulightbox/$domain=hdkinomax.com\|tvfru.net	RU AdList: BitBlock	https://code.google.com/p/ruadlist/
/http://cdn[0-9]\.spiegel\.de/images/image-([^-]+)-[^-]+-[^-]+-(?!\1)[^-]+\.jpg/$image,domain=spiegel.de	EasyList Germany	https://easylist.github.io/

ameshkov commented 8 years ago

Please note the number of rules which are mistakenly made regexp-type.

kzar commented 8 years ago

@gorhill I've not been involved in that issue so far, so just done a quick bit of reading. I might get some things wrong.

While I agree that grabbing a keyword from the regexp seems scary, I'm not sure how the suggested token option would help. Take your filenuke example, there the automatic keyword would have been "filenuke" anyway.

Now if you think of a more advanced example which matches one of two possible domains, what would you put for the token option? If you chose to use parts of one of the domain as a keyword you'd end up not matching the other domain. Instead you'd have to omit the token option, which would end up as the same result as the automatic approach. (Since they mention that those kind of strings should be ignored.)

kzar commented 8 years ago

(I wonder if we could copy the content blocking approach of compiling all these regular expressions into a finite state machine? That could be a way to make matching regular expression filters faster without worrying about keywords.)

ameshkov commented 8 years ago

(I wonder if we could copy the content blocking approach of compiling all these regular expressions into a finite state machine? That could be a way to make matching regular expression filters faster without worrying about keywords.)

This would be an overkill

In order to do it they have restricted regular expressions support to a very limited subset.

gorhill commented 8 years ago

Take your filenuke example

Yes, bad example. Here is another one found in EasyList:

/\/[0-9].*\-.*\-[a-z0-9]{4}/$script,xmlhttprequest,domain=gaytube.com|keezmovies.com|spankwire.com|tube8.com|tube8.es|tube8.fr

Not sure if a token was available for this one -- whoever created the filter knows, but mainly my point is that token= option, would be an easy low-tech way available immediately (easy implementation) to deal with this, with no need for a regex parser (which would fail anyway with the filter here). If no token is present for untokenizable filter, then we just end up with the current behavior.

ameshkov commented 8 years ago

Let's first think about what issue we are trying to solve.

First of all, domain-restricted filters are not a problem as there is no influence on the overall performance.

I suppose, that what we really need is to reduce the negative impact of the mistakes made by filters authors. For instance, the filters like /ajrotator/ and such. There is no problems with extracting a token from a rule like this.

Here is just a dirty example of a token extracting function:

var extractToken = function(ruleText) {

    // Get the regexp text
    var reText = ruleText.match(/\/(.*)\/(\$.*)?/)[1];

    var specialCharacter = "...";

    if (reText.indexOf('(?') >= 0 || reText.indexOf('(!?') >= 0) {
        // Do not mess with complex expressions which use lookahead
        return null;
    }

    // (Dirty) prepend specialCharacter for the following replace calls to work properly
    reText = specialCharacter + reText;

    // Strip all types of brackets
    reText = reText.replace(/[^\\]\(.*[^\\]\)/, specialCharacter);
    reText = reText.replace(/[^\\]\[.*[^\\]\]/, specialCharacter);
    reText = reText.replace(/[^\\]\{.*[^\\]\}/, specialCharacter); 

    // Strip some special characters
    reText = reText.replace(/[^\\]\\[a-zA-Z]/, specialCharacter); 

    // Split by special characters
    var parts = reText.split(/[\\^$*+?.()|[\]{}]/);
    var token = "";
    var iParts = parts.length;
    while (iParts--) {
        var part = parts[iParts];
        if (part.length > token.length) {
            token = part;
        }
    }

    return token;
};

I've tried this function with the rules above and here is the result: https://ameshkov.github.io/web/regex-tokens.html?1

What for the token proposition, here are the downsides I see:

It does not solve the issue with regex filters created by mistake.
Complex rules which cannot be tokenized are rare. There are only 2 such filters in EasyList and both are domain-restricted.
No backward compatibility, filters with unknown options will be ignored by old versions. Also, for instance, getadblock guys aren't invited to our party so it could be a surprise for them.

gorhill commented 8 years ago

getadblock guys aren't invited to our party

They are using ABP's filtering engine since AdBlock v3.0. See https://github.com/kzar/watchadblock/releases/tag/3.0.

ameshkov commented 8 years ago

The other points still stand though:)

gorhill commented 8 years ago

I wasn't aware of the many erroneous regex filters, looks like this can be easily addressed with a trivial code for these cases.

Mainly it was just to throw an idea out there, since these untokenizable filters have always bothered me[1], and I knew there was an issue like this opened on ABP issue tracker -- so I just threw the idea out there to have an easy fix, worth only if actually used by filter list maintainers.

Anyway, I will just use this issue here to throw ideas once in a while which I think might be good for all blockers[2], especially when it comes to make the life of filter list maintainers easier.

[1] I was looking to even skip testing for domain hit -- but this is an implementation-dependent detail I suppose [2] I understand that when a filter syntax is not supported by ABP, EasyList et al. maintainers won't use it.

ameshkov commented 8 years ago

[2] I understand that when a filter syntax is not supported by ABP, EasyList et al. maintainers won't use it.

By the way, I'd like to raise a question about the non-standard syntax.

You have recently added a couple of pseudo-classes extending element hiding rules syntax. I am talking about :has(), :xpath(), :matches-css [1] and such.

The idea is really great and we will support some of these extended selectors as well (:has() and :contains() are currently in the beta testing stage, :matches-css() is coming).

However, there is one issue that bothers me. The syntax you use (pseudo-classes syntax) is not backward-compatible and it will break good old stylesheet-based ad blockers like Adguard and ABP.

/* browser will ignore the whole style due to the second selector */
#banner, #banner:has(.test) { display: none; }

I suggest introducing a backward-compatible syntax along with the modern pseudo-classes-based one.

Backward compatible synonym for :has(...) will be [-ext-has="..."] Backward compatible synonym for :matches-css(...) will be [-ext-matches-css="..."] Backward compatible synonym for :xpath(...) will be [-ext-xpath="..."]

[1] As I understand, there is a backward compatible :matches-css() option already: https://issues.adblockplus.org/ticket/2390

kzar commented 8 years ago

You have recently added a couple of pseudo-classes extending element hiding rules syntax. I am talking about :has() ...

FWIW We are working towards adding the :has selectors too https://issues.adblockplus.org/ticket/3143

Anyway, I will just use this issue here to throw ideas once in a while which I think might be good for all blockers[2], especially when it comes to make the life of filter list maintainers easier.

:+1: Please do, I think collaboration benefits us all.

ameshkov commented 8 years ago

@kzar so, what do you think about the backward compatible syntax proposition?

ameshkov commented 8 years ago

@kzar regarding Lain's comment:

I think it's worth mentioning that :has() selector must work in combination with -abp-properties. So, filter like site.name##.block:has([-abp-properties="background: yellow"])

Using proposed syntax it could look like this: ##.block[-ext-has="*:matches-css(background: yellow)"]

kzar commented 8 years ago

@ameshkov Well I think the idea is that when browsers eventually support :has selectors those filters will be again using standard CSS selectors anyway. We only need to implement special logic for those filters in the mean time as a stop-gap. I guess it's true (and unfortunate) that the syntax will break filters for ad blockers which haven't added support for now, but I guess that's not too bad since uBlock, AdGuard and Adblock Plus all plan to support them. (Also because they are only planned to be something used as a last resort.)

As for the general point of using backward compatible syntax like you've suggested, I think it's a good idea. (We already do something like that for CSS property filters using the -abp-properties attribute.)

ameshkov commented 8 years ago

Well I think the idea is that when browsers eventually support :has selectors those filters will be again using standard CSS selectors anyway.

True. However, here is one more argument for that type of syntax. We all support a lot of different browsers (including mobile and such) and trying to use pseudo-classes syntax requires us to do it simultaneously for all the platforms. While backward-compatible syntax allows us to roll this feature out gradually.

As for the general point of using backward compatible syntax like you've suggested, I think it's a good idea. (We already do something like that for CSS property filters using the -abp-properties attribute.)

Yeah, I know, that's why I was surprised by the implementation proposed in the issue 3143.

gorhill commented 8 years ago

I suggest introducing a backward-compatible syntax along with the modern pseudo-classes-based one.

I will support the backward-compatible syntax where possible, but personally, internally I prefer using the :() syntax. I see these new operators as nodes in a processing graph, and thus being able to easily and freely combine them I see this as a requirement for the future. Example[1]:

div.red:has(div.blue:matches-css(position: fixed;):contains(allo)):contains(publicité)

It does feel to me like a backward-compatible syntax would complicate writing such filters (especially the use of quotes):

div.red[-ext-has="div.blue[-ext-matches-css=\"position: fixed;\"][-ext-contains=\"allo\"]"][-ext-contains="publicité"]

Aren't you validating element hiding filters at load time (or else using invalid CSS selector would break element hiding) so isn't true that old versions will discard filters with this new syntax? (Element:matches('div:has(span)') would throw).

[1] Ok, the example is contrived, but it's just to illustrate easily combining such filters.

ameshkov commented 8 years ago

It does feel to me like a backward-compatible syntax would complicate writing such filters (especially the use of quotes):

Yeah, frankly, when I check something, I prefer to use the newer syntax as well.

However, it's not that bad, there's no need to support it inside of a composite filter.

Here, look at this example:

div.red[-ext-has="div.blue:matches-css(position: fixed):contains(allo):contains(publicité)"]

ameshkov commented 8 years ago

Aren't you validating element hiding filters at load time (or else using invalid CSS selector would break element hiding) so isn't true that old versions will discard filters with this new syntax? (Element:matches('div:has(span)') would throw).

Nope, in fact it was all of a sudden for us:) Also there's no way we could do it in desktop and mobile versions.

ameshkov commented 8 years ago

@gorhill one more thing regarding the :matches-css(). I propose using a bit different syntax for it.

Could you please read this issue description and tell me what you think about it? https://github.com/AdguardTeam/ExtendedCss/issues/7

gorhill commented 8 years ago

Q: Why additional pseudo-classes for matching before and after

I already support selector:after:style-properties(pattern), I just extract the :after before using the selector at setup time. But I would not mind selector:style-properties-before(pattern) -- it would just make the setup code a bit simpler.

Q: Why pattern-matching?

I agree with (optional) pattern matching. Pattern-matching is not something I implemented, but I don't see a problem supporting this. For the implementation side of such filter however, I would just want to be sure its semantic does not force a very specific implementation.[1]

I suppose that using this approach we could also cover existing abp-properties rules

Note that ABP's -abp-properties has been implemented with a very different semantic in mind than something like :matches-css: to reverse lookup CSS rules. Such filters shouldn't be used directly on a set of nodes for filtering purpose. The purpose of all the filters I have been adding lately are to reduce a set of nodes (starting with one as small as possible), so the suffix part is key, to start with the smallest set of nodes possible is key for performance.

For example, a filter such as wetter.com##[-abp-properties='margin-left: 24px'], given that it has no suffix selector, would have to be tested for all elements on a page, which would just kill performance.

[1] I see using cssText as a potentially high overhead approach, so I went with the dictionary approach, to test only for the enumerated properties. a) I suspect the cssText string is generated on the fly by the browser when "getted"; b) using cssText forces the use of a regex which will apply to a potentially large string.

ameshkov commented 8 years ago

I already support selector:after:style-properties(pattern)

It may look pretty good, but it bothers me that :after in fact can't be part of a valid selector as pseudo-element cannot be selected. I suppose it could mislead a filter author.

[1] I see using cssText as a potentially high overhead approach, so I went with the dictionary approach, to test only for the enumerated properties. a) I suspect the cssText string is generated on the fly by the browser when "getted"; b) using cssText forces the use of a regex which will apply to a potentially large string.

Yep, I've run into a number of issues while implementing it. For now I've used a cross-browser function for extracting the cssText string: https://github.com/AdguardTeam/ExtendedCss/blob/feature/issues/7/lib/style-property-matcher.js#L96

Also I agree with you on the enumerated properties approach. There's no need in building the cssText field, I will change the current implementation.

For example, a filter such as wetter.com##[-abp-properties='margin-left: 24px'],

Yeah, you're right. Also now when I know how this type of rules work, I find it a bit misleading. At least I think Lain_13 does not understand how it works.

@kzar what do you think about implementing something more "straightforward"?

ameshkov commented 8 years ago

I guess if we use the properties approach and agree on *-before/after postfix, there is no need for me to use another name for that pseudo class. matches-css, matches-css-before and matches-css-after sounds good and describes the filter behaviour very well.

gorhill commented 8 years ago

matches-css, matches-css-before and matches-css-after sounds good and describes the filter behaviour very well.

I agreed with this. This new selector, combinable with :has() is going make filter list maintainers' life easier.

ameshkov commented 8 years ago

I've updated the syntax description: https://github.com/AdguardTeam/ExtendedCss/issues/7

gorhill commented 8 years ago

Looking into this specific case this morning: https://github.com/uBlockOrigin/uAssets/issues/110.

This would be solvable without exception filters if it was possible to outright remove the targeted nodes from the DOM:

finanzen.net###bodyCenter > div[id]:has(:scope > #Ads_BA_Sky):remove()

The current implicit action to take on targeted nodes is to hide them. However, being to re-style has make the job of working against anti-blocker mechanisms much easier (AdGuard support this).

Additionally, being able to remove nodes from the DOM is something I have found would take care of many other cases as well (I do believe AdGuard support this in some ways, not sure). From my point of view, being forced to whitelist network requests from 3rd-party advertisers/trackers is always the worst option, and we should extend the capabilities of cosmetic filtering (element hiding) to avoid such whitelisting.

ameshkov commented 8 years ago

Oh, you have finally faced these german wunderwaffe-anti-adblock-solutions:) I was impressed when I saw this particular script for the first time.

Currently the easiest way to circumvent it is to inject a script like this:

Object.defineProperty(window, `UABPtracked`, { get: function() { return true; }, set: function() {} })

ameshkov commented 8 years ago

Regarding the DOM nodes removal thing, I need some time to think about it.

gorhill commented 8 years ago

Currently the easiest way to circumvent it is to inject a script like this

I didn't realize they were using the uabp thing, I already had a scriptlet to take care of these -- it was not injected on that site.

Though in the long term, scriplets require more work and maintenance, and I would rather use generic cosmetic filter syntax where possible. In the current case, a node removal would work. It would also work for that case (edit: never mind, would not work for this case). Anyway, something to think about.

ameshkov commented 8 years ago

In the current case, a node removal would work

However, in this particular case node removal is not the best solution. This anti-adblock script is pretty ugly, it sets up a timer and redraws ads every 5 or so seconds. And with nodes removed it continues to do something with DOM.

Talking about anti-adblock scripts, I really do not see a good declarative solution which does not involve scripting.

ameshkov commented 8 years ago

Let's start with analysis. Most of the things we discuss are directly caused by the websites trying to circumvent ad blocking.

Basically, there are two approaches:

Make ad layout random or looking exactly as content layout.
Detect an ad blocker and show some warning or even redirect user to a blocking page.

Point 1 can be solved by the new pseudo-elements (at least for now). Point 2 can be solved by scripts (like reek's AAK for instance).

Btw, reek is the best anti-adblock scripts expert I know, let's ask his opinion.

kzar commented 7 years ago

@gorhill @ameshkov We are discussing WebSocket circumvention on the Adblock Plus issue tracker, but unfortunately we've had to make the issue confidential. (Guess why...) Anyway I'd like to copy you both in on the issue, as mapx pointed out it would be good to get your feedback there too.

Are you guys signed up on our issue tracker? If so what are your usernames?

kzar commented 7 years ago

@gorhill Also a possibly dumb question, doesn't a Content Security Policy like connect-src http:; frame-src http: also prevent https connections?

gorhill commented 7 years ago

doesn't a Content Security Policy like connect-src http:; frame-src http: also prevent https connections?

Not according to spec:

The URL matching algorithm now treats insecure schemes and ports as matching their secure variants. That is, the source expression http://example.com:80 will match both http://example.com:80 and https://example.com:443.

ameshkov commented 7 years ago

Guess why...

They will see it anyway:)

Are you guys signed up on our issue tracker? If so what are your usernames?

Just signed up, username is ameshkov

kzar commented 7 years ago

@gorhill Ahh, makes sense. @ameshkov Cool, added you to the issue.

kzar commented 7 years ago

@gorhill, @ameshkov Heads up, we're going to consider WebSocket requests as the type "websocket" instead of "other" in the future. More details in this blog post: https://adblockplus.org/development-builds/new-filter-type-option-for-websockets

ameshkov commented 7 years ago

@kzar hey Dave, thanks for the heads up.

removed. much confidential, very secret.

ameshkov commented 7 years ago

@gorhill @kzar Btw, have you already seen the bleeding edge technology: loading ads code through RTCPeerConnection?

gorhill commented 7 years ago

have you already seen the bleeding edge technology

Yes, first time I saw it on Merriam-Webster's site.

ameshkov commented 7 years ago

Any idea besides wrapping RTCPeerConnection?

gorhill commented 7 years ago

So far, no -- aside giving users the option of disabling entirely WebRTC.

kzar commented 7 years ago

@ameshkov No, I did not realise people already started abusing WebRTC. Man. :-1:

kzar commented 7 years ago

Do you guys have an URL for an example of a website using WebRTC for circumvention that I can take a look at?

ameshkov commented 7 years ago

Actually would you mind removing that comment here?

Done;)

So far, no -- aside giving users the option of disabling entirely WebRTC.

Does it really work in Chrome? I thought it is a bit limited.

ameshkov commented 7 years ago

Do you guys have an URL for an example of a website using WebRTC for circumvention that I can take a look at?

Code example: https://forum.adguard.com/index.php?threads/block-rtcpeerconnection.13808/#post-102128

gorhill commented 7 years ago

I'd rather discuss our WebSocket plans in the issue on our tracker, since it's marked confidential

I understand not discussing ideas of workarounds for our own blocking solutions, but here I don't see the point, the websocket issue came about because it's already used out there.

kzar commented 7 years ago

@ameshkov Thanks!

kzar commented 7 years ago

@gorhill There's a new issue I'd like to involve you with but can't unless you have a user on our issue tracker. Mind creating one?

gorhill / uBlock

Placeholder issue for discussion of issues in ABP/AdGuard issue tracker -- and possible solutions #1930