uBlockOrigin / uBlock-issues

This is the community-maintained issue tracker for uBlock Origin
https://github.com/gorhill/uBlock
929 stars 79 forks source link

Address 1st-party tracker blocking #780

Closed aeris closed 4 years ago

aeris commented 4 years ago

Helle here!

Since friday, we hit a case of 1st-party tracking that seems to be unblockable.

This occurs on https://www.liberation.fr/, embedding a 1st-party tracker f7ds.liberation.fr, which point to a ugly tracking provider Eulerian via the CNAME liberation.eulerian.net.

This provider clearly states it provide unblockable tracker EJAeTXvWwAAqTPz EJAwd5wWkAAjmsN

Seems Criteo starts to ask the same to their customer, with 1st-party tracking pointing to *.dnsdelegation.io subdomain.

In this case, it seems really difficult to block such tracker by tools like uBlock:

Do you have any way to detect then block such content from the browser? The only (not so) efficient way I have at the moment is using DNS tools like PiHole to blacklist range of IP and CNAME pattern resolution. And even this way, it doesn't cover all the possible case… Even tools like µMatrix seems totally inefficient on such tracker…

uBlock-user commented 4 years ago

Do not post any filter list issues or issues where website's functionality is broken. We have uAssets issue tracker for that, post there instead.

https://github.com/uBlockOrigin/uBlock-issues#ublock-issues

gorhill commented 4 years ago

It's a technique used to bypass filters/rules, it's something which needs to be investigated.

liamengland1 commented 4 years ago

Dupe/related discussion: https://github.com/uBlockOrigin/uAssets/issues/6538

uBlock-user commented 4 years ago

Aren't they lying to PSL with these first-party domain entries ?

Edit: It's an inline-script, should be able to defuse via a scriptlet.

liberation.fr##+js(aopw, EA_data) works.

uBlock-user commented 4 years ago

Here's a crude dump of sites using Eulerian Analytics inline-script -- https://publicwww.com/websites/EA_data/

liamengland1 commented 4 years ago

@uBlock-user that scriptlet will only work for sites inserting the script using that variable. For other sites like oui.sncf, use this: https://github.com/uBlockOrigin/uAssets/issues/6538#issuecomment-552202850

uBlock-user commented 4 years ago

Websites I tested so far are using that variable, except for the one you mentioned. oui.sncf redirects me to https://en.oui.sncf/fr/?redirect=yes where parseInt.+?3600000 is not found in the inline-script.

As per view-source:https://en.oui.sncf/fr/?redirect=yes, this is the js --

<script>
(function(d, s, id) {
  if (d.getElementById(id)) return;
  var js = d.createElement(s),
      fjs = d.getElementsByTagName(s)[0],
      vscaUrl = "//wblt.oui.sncf";

  js.id = id;
  js.async = true;

  js.src = vscaUrl + "/prod/" +
      (vsca_pageTag.config.vsca_version ? vsca_pageTag.config.vsca_version + "/" : "") +
      vsca_pageTag.config.siteId +
      "/vsca.js?M2lU3mD1O47ZAzgnp0wX";

  fjs.parentNode.insertBefore(js, fjs);
}(document, 'script', 'vscascript'));
</script> 

Filter -- oui.sncf##+js(acis, document.getElementById, vscaUrl)

liamengland1 commented 4 years ago

I'm in US, it did not redirect me. The inline script on oui.sncf is

  <!--begineulerian-->
  <script type="text/javascript">
    (function(){var d=document,l=d.location;if(!l.protocol.indexOf('http')){var o=d.createElement('script'),a=d.getElementsByTagName('script')[0],cn=parseInt((new Date()).getTime()/3600000);o.type='text/javascript';o.async='async';o.defer='defer';
    o.src='//v.oui.sncf/content/vsc-fr/8lL.QlYVeQ7BL6AqQORYg_FeHeIQMaObMRxsXxGG0g--/'+cn+'.js';
    a.parentNode.insertBefore(o,a);}})();
  </script>
  <!--endeulerian-->

And the inline script in https://github.com/uBlockOrigin/uBlock-issues/issues/780#issuecomment-552206887 is not Eulerian, it is another tracker, not the one @aeris is talking about. Another site: officedepot.fr. Add officedepot.fr##+js(acis, document.createElement, parseInt)

uBlock-user commented 4 years ago

Probably because of difference in geo-location of ourselves, we're not being served the same script. It may not be Eulerian but it's in the same vein as that.

Another site: officedepot.fr

That one definitely EA -- https://myip.ms/info/whois/109.232.195.156/k/3227454398/website/ea.officedepot.fr

aeris commented 4 years ago

New detection : keyade.com, on rueducommerce.fr omtrdc.net, on sfr.fr

liamengland1 commented 4 years ago

Offtopic:

Weird thing: it seems a pattern is the scripts ending with 7825. So here's a regex you can add to your filters ... (note-i'm not a regex expert obviously) /(\.\w+)[.]?\/[A-z]{7}(7825)\.js$/

Example scripts:

https://f7ds.liberation.fr/aaAAaaA7825.js
https://v.oui.sncf/SNCFVOU7825.js
https://ea.officedepot.fr/potfrWW7825.js

Test sites: https://www.maeva.com and https://www.brandalley.fr/

Also another PublicWWW search: https://publicwww.com/websites/%22parseInt%28%28new+Date%28%29%29.getTime%28%29%2F3600000%29%22/

gwarser commented 4 years ago

Wondering if https://github.com/uBlockOrigin/uBlock-issues/issues/44 can will apply here if implemented.

gorhill commented 4 years ago

Can't apply, the case given as example make use of legitimate subdomains, statics.liberation.fr, medias.liberation.fr.

I am looking at https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/WebExtensions/API/dns/resolve, it can be used to expose the CNAME:

browser.dns.resolve('f7ds.liberation.fr', [ "canonical_name" ]).then(r => { console.log(r); });
Promise { <state>: "pending" }
Object { addresses: (1) […], canonicalName: "atc.eulerian.net", isTRR: false }

I will prototype and evaluate how to optimally use this in uBO with the utmost care.

uBlock-user commented 4 years ago

Will this be applied in uMatrix too ?

gorhill commented 4 years ago

Yes.

uBlock-user commented 4 years ago

You will need to add a new permission named 'dns' in the manifest to use this API - https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/WebExtensions/API/dns and since this is Firefox only API, how will you address this in Chromium ?

aeris commented 4 years ago

I am looking at https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/WebExtensions/API/dns/resolve, it can be used to expose the CNAME:

Time to think about the future too. This detection can easily be bypassed with CNAME removal and a direct A/AAAA. Perhaps time to include IP range blacklist or AS number detection ? :thinking: For Eulerian, IP (109.232.197.0/24) and ASN (AS50234) are dedicated, so no false positive or negative, but may be more complicated in case of mutualised ones…

gorhill commented 4 years ago

how will you address this in Chromium ?

uBO already make use of Firefox-specific API, for example, filterResponseData().

uBlock-user commented 4 years ago

I meant how will you fix this in Chromium..

gorhill commented 4 years ago

Best to assume it can't be fixed on Chromium if it does not support the proper API.

gwarser commented 4 years ago

the case given as example make use of legitimate subdomains

In case by case basis, regex with whitelist-approach assertion can be used:

/^https:\/\/(?!www|images|medias|statics)/$script,1p,domain=liberation.fr
rigelk commented 4 years ago

Time to think about the future too. This detection can easily be bypassed with CNAME removal and a direct A/AAAA. Perhaps time to include IP range blacklist or AS number detection ?

@aeris I assume that would mean bundling a list of ranges to block, some of which generated from a list of known AS. There is no API in Firefox to resolve the IP ranges of an AS, is there?

I reckon we could generate a list using RIPE's api (based on this data), with for instance: https://stat.ripe.net/data/routing-history/data.json?resource=AS50234 or a JS client for it (doc).

gorhill commented 4 years ago

In case by case basis, regex with whitelist-approach assertion can be used:

A csp= directive is preferable to a regex:

||liberation.fr^$csp=script-src www.liberation.fr images.liberation.fr medias.liberation.fr statics.liberation.fr

(plus whatever else is needed of course).

aeris commented 4 years ago

@rigelk Will be difficult, yep :joy: Even obtain the AS from an IP or domain is tricky, and is full time study for Tor (see this)

gwarser commented 4 years ago

A csp= directive is preferable to a regex:

Yes, I thought about it, but page may include unlimited number of external resources, it will be hard to not block them accidentally.

uBlock-user commented 4 years ago

CSP will be the preferable solution for Chromium users.

aeris commented 4 years ago

New detection : Xiti now does 1st-party tracking lemonde.fr → buf.lemonde.fr → buf-lemonde-fr-cddc.at-o.net

echo | openssl s_client -connect buf-lemonde-fr-cddc.at-o.net:443 |& rg depth=0
depth=0 C = FR, L = MERIGNAC, O = AT Internet, OU = Service Technique, CN = *.ati-host.net

AT Internet = Xiti

Same on client.boursorama.com → c0011.boursorama.com → c0011-boursorama-com-cddc.at-o.net

Image injection this time, no javascript involved…

aeris commented 4 years ago

La FNAC, 3 1st-parties :

aeris commented 4 years ago

New and tricky case, with more difficulties to detect or block. 20minutes.fr includes contents from 20mn.fr which seems to be their CDN domain. Content (surely JS) from this CDN domain loads back content on primary 20minutes.fr domain, with a a.20minutes.fr, wich is a-20minutes-fr-cddc.at-o.net and so Xiti.

More interesting, they also have a.20min.fr, pointing to ads.20min.maxcdn-edge.com, which is not currently in production, but this case is trickier to handle because the final domain is not a dedicated one. We need regexp ^ads exclusion on this case.

roipoussiere commented 4 years ago

fyi Eulerian gently provides your test suite in their Privacy page:

Appendix: list of sites on wich our clients use our software solutions

AT/Xiti also, but a bit of scripting is required.

Maybe these lists can be used to generate a blacklist fo subdomains?

aeris commented 4 years ago

I try to develop a think to check for a domain if there is eulerian subdomain. You can't generate 1st-level domain blacklist just with top-level domain :sob: You have to really crawl the page, execute the JS, and listen on a dummy DNS resolver to catch the tracker. POC on the road.

roipoussiere commented 4 years ago

You can't generate 1st-level domain blacklist just with top-level domain sob You have to really crawl the page, execute the JS, and listen on a dummy DNS resolver to catch the tracker. POC on the road.

With headless browser like PhantomJS it's possible to execute JS of a website in a script.

roipoussiere commented 4 years ago

Also, Confess is a PhantomJS script that can be used to headlessly analyze web pages.

gorhill commented 4 years ago

I would prefer to keep the issue here as focused as possible: to deal with CNAME-ed hostnames. For investigation work about list of hostnames being CNAME'd or other "evasion" mechanisms, this is best done elsewhere -- though you can link to that elsewhere here if useful for the current issue. At this point whoever subscribed to this issue is being notified non-stop about every single new comment being made.

If you want to bring forth a new evasion mechanism, please open a new issue about it.

Sispheor commented 4 years ago

Maybe a reverse lookup could be done. Once we have the final IP, check what DNS entry is linked to it. Or maybe add a feature based on the community, where people can add manually an entry that is shared to other member. And for each entry we can, like on Waze, add a "I validate it" buton or something to prevent false URL or to cleanup URL that doesn't exist anymore. But for this last idea you need a server to broadcast all info...

gorhill commented 4 years ago

If using 1.24.1b0 and above, to "uncloak" actual (canonical, CNAME) hostname, set advanced setting cnameAliasList to *.

Network requests for which the actual hostname differs from the original hostname will be replayed through uBO's filtering engine using the actual hostname. When I started developing the feature I could spot eulerian.net in the logger when visiting https://www.liberation.fr/, but I can no longer reproduce this. Regardless, uBO is now equipped to deal with 3rd-party disguised as 1st-party as far as Firefox's browser.dns allows it.

The next step is for me to pick a cogent way for filter list maintainers to be able to tell uBO to uncloak specific hostnames, as doing this by default for all hostnames is not a good idea -- as this could cause a huge amount of network requests to be evaluated twice with no benefit for basic users (default settings/lists) while having to incur a pointless overhead -- for example when it concerned CDNs which are often aliased to the site using them.

uBlock-user commented 4 years ago

image

Access IP address and hostname information

That's the new permission title when first updated to this build or any future stable builds with DNS WebExt. API for anyone wondering what this is.

uBlock-user commented 4 years ago

but I can no longer reproduce this.

Disabling liberation.fr##+js(acis, document.createElement, '.js') found in uBO-Privacy makes reproduction possible again.

x0wllaar commented 4 years ago

Best to assume it can't be fixed on Chromium if it does not support the proper API.

Can't this be "emulated" in Chromium by resolving the hostnames using DNS over HTTPS in JSON format (https://developers.cloudflare.com/1.1.1.1/dns-over-https/json-format/)?

For example, I can use Cloudflare's DNS with curl -H 'accept: application/dns-json' 'https://cloudflare-dns.com/dns-query?name=f7ds.liberation.fr&type=CNAME' and get

{"Status": 0,"TC": false,"RD": true, "RA": true, "AD": false,"CD": false,"Question":[{"name": "f7ds.liberation.fr.", "type": 5}],"Answer":[{"name": "f7ds.liberation.fr.", "type": 5, "TTL": 2633, "data": "liberation.eulerian.net."}]}

Which obviously contains the tracking hostname.

There's an obvious issue with using Cloudflare for this (although Firefox does by default after you enable DoH, so probably it's not such a privacy disaster). There's at least one DoH resolver that supports the same JSON API and claims to respect user privacy, https://blahdns.com (I am not in any way affiliated with them).

To speed things up, maybe it's possible for uBlock to maintain its own cache of hostnames and re-resovle only once in a while.

gorhill commented 4 years ago

Can't this be "emulated" in Chromium by resolving the hostnames using DNS over HTTPS in JSON format?

lknik commented 4 years ago

Interesting case of first-party NS alias scheme. I discovered and studied a similar approach by OpenX. Perhaps you'll find it of use?

Back then I suggested this rule:

The default filter list provide rules enabling the blocking of those requests. For example, the rule ox-d.*^auid= matches against requests to http://ox-d.example.com/auid=.... This would effectively block all requests to these domains.

But indeed if the domain name part is random this gets complicated. Good luck on solving it!

orefalo commented 4 years ago

Isn't that the technique? https://lucb1e.com/rp/cookielesscookies/

mkeenan-anomali commented 4 years ago

Also, another avenue to check is not just a canonical name lookup, but also the AS number, it won't catch cloud hosted solutions, but for service providers that use their own networks to host tracking servers then this might add another / different data point. Any reasonable whois JSON service will return the owner / AS number.

beniz commented 4 years ago

Jumping in here to say that machine learning might do it. A closer look at in-depth data is required of course. I've personally built character-based CNN to block URLs in the past, not difficult. Avoiding false positives is also possible, at the expense of letting more unwanted traffic through. Collecting returns on false positives would allow improving the models. Anyone interested ok n this can ping me, I have GPUs available, and other resources.

roeme commented 4 years ago

The next step is for me to pick a cogent way for filter list maintainers to be able to tell uBO to uncloak specific hostnames, as doing this by default for all hostnames is not a good idea -- as this could cause a huge amount of network requests to be evaluated twice with no benefit for basic users (default settings/lists) while having to incur a pointless overhead -- for example when it concerned CDNs which are often aliased to the site using them.

FF's dns.resolve() at least seems to cache, it remains to be checked wether passing canonical_name will incur a second request, or the cached information from the first request is enough. And then there might be a possible difference between Mozilla's TRR and the system's resolver.

pgl commented 4 years ago

Does anyone know of a service that could be used to look up CNAMEs that point to specified hostnames?

At least for my list, I could create a script that does a reverse-CNAME-lookup for entries. I've applied to use Farsight's DNSDB, we'll see if they let me in.

No service could be 100% accurate, but it might help.

janis-veinbergs commented 4 years ago

Does anyone know of a service that could be used to look up CNAMEs that point to specified hostnames?

DNS over HTTPS ? For example, cloudflare

Invoke-RestMethod -Headers @{"Accept" = "application/dns-json"} "https://cloudflare-dns.com/dns-query?name=f7ds.liberation.fr&type=CNAME" | ConvertTo-Json
{
    "Status":  0,
    "TC":  false,
    "RD":  true,
    "RA":  true,
    "AD":  false,
    "CD":  false,
    "Question":  [
                     {
                         "name":  "f7ds.liberation.fr.",
                         "type":  5
                     }
                 ],
    "Answer":  [
                   {
                       "name":  "f7ds.liberation.fr.",
                       "type":  5,
                       "TTL":  3538,
                       "data":  "liberation.eulerian.net."
                   }
               ]
}
pgl commented 4 years ago

@janis-veinbergs I'm afraid that's looking up which hostname a CNAME record points to. You can do this with standard DNS lookups.

I'm looking for a service that, given a hostname, will show which CNAMEs point to it.

cmoro-deusto commented 4 years ago

@pgl maybe this service? https://mxtoolbox.com/CNAMELookup.aspx

pgl commented 4 years ago

@cmoro-deusto This doesn't allow me to find which CNAMEs point to a particular hostname.