lightswitch05 / hosts

Hostfile blocklist for ads and tracking, updated regularly
https://www.github.developerdan.com/hosts/
Apache License 2.0
1.51k stars 75 forks source link

[Question] Client IP and location APIs #281

Closed MichaIng closed 2 years ago

MichaIng commented 3 years ago

A have a general question, as we recognised that various client IP and location API domains are on blocklists, yours as well, like reallyfreegeoip.org, ipapi.co and such on ads-and-tracking-extended.txt. These break scripts and programs which try to show the clients public IP and location (interesting when using a VPN) based on these API services.

What I don't understand is how blocking those IPs via DNS override should help against ads or tracking:

Since many blocklists contain such API URLs, I guess there is a reason I do oversee? Probably someone can shed some light on this 🙂.

lightswitch05 commented 3 years ago

Hey Michalng. Great question. I don't have all the answers, but I'll share my observations and opinion on a few points:

Every website or ad provider has your IP anyway as part of the request.

Yes, for the most part. Ad provider certainly does, the ad buyer might need to do extra steps.

Every website or ad provider has your IP anyway as part of the request.

I'm not an expert on how the Ads work, but generally there is a prebid/bid phase that shares data and then advertisers can bid to win the ad spot. Depending on the ad system, there are different rules about what can run, but I think running javascript is standard - and so the advertiser can then report the IP to their trackers. You are right that reporting an IP is just as simple as making a server request - just like a single pixel. But then there is the GDPR - and advertisers have to play by the rules based on the geo location. That means they may have to make a decision on the client side if they are within GDPR before they report that data back to the server.

When GDPR first rolled out, I saw a crazy increase in the amount of these services being used. If I were developing a system that was GDPR sensitive and I couldn't determine for a fact that the IP was NOT in the GDPR, I personally would fall back on the assumption that they were in the GDPR rather then risk legal issues. That is why I like to block them, in hopes that the system will 'fall back' onto respecting my privacy.

When a website wants to find out about your location based on your IP, e.g. to show localised ads and such, it needs to do the request itself, so the blocklist entry on your client cannot prevent that.

Yes, but they first have to create that logic, which takes development time and effort. Using a 3rd party API which has already done that work is a lot easier and gets faster results. As you say, a server can easily do this, but I see these things happen a lot, so that is the only explanation I can think of.

Those APIs send IP and location to the requesting client, hence your browser, not to the website, ads provider or tracking tool.

I don't understand what you mean 'hence your browser'. An IP alone doesn't provide a location and to get a location from a browser, you have to first ask the user for permission to access the location. Using a 3rd party that already provides a location to an IP is a lot easier and doesn't require asking for permission.

What is instead prevented are e.g. widgets on a website which shows the connecting client info, but I wouldn't call that an ad, but it's common part of websites which one does explicitly visit to get that info.

Is this really that common? Is it causing that much trouble? What I normally do if I need to know if my traffic is being routed through a VPN or not is go to DuckDuckGo and type "whats my ip". You don't even have to click on any links, DuckDuckGo shows you your IP and the geolocation linked to the IP.


I went back through some of my notes for these entires just now, and it seems a lot of the pages I found using them have since migrated away and now just ask for browser permission - which is much better. Others, I guess have implemented their geo-lookups in-house and no longer need the 3rd party.

Here is a particular one that is still using a free geo IP service to determine if the user should be tracked or not:

ao-freegeoip.herokuapp.com: Found on the URL https://www.atlasobscura.com/articles/grapefruit-history-and-drug-interactions, the URL used is https://ao-freegeoip.herokuapp.com/json/.

The script making the API call is https://assets.atlasobscura.com/assets/application-f5556fcb961e82d490c3f9ca5391a3696485a2d856bbb2fa2fb2b6473e37a0c1.js.

Prettified: https://gist.github.com/lightswitch05/f4d004083241f8a8a5720e09651d6f56

If the script determines you are in CA (californiaVisitor), then you are asked for consent for a tracking cookie. If it determines that you are isSeattleVisitor, then it sets a special "Seattle Visitor Detected" ad campaign flag on the google analytics cookie. What is interesting is that it has a geolocateUser which can ask a user's browser to allow tracking, but it still does this geo ip lookup anyways. Why don't they move this API in house vs. using this free API? Why don't they solely rely on asking the user for their location via browser consent? I don't know, but this is what they are doing.

MichaIng commented 3 years ago

Many thanks for the answer.

Yes, but they first have to create that logic, which takes development time and effort.

My point was not that those 3rd party IP-to-location providers are not used, but that a host entry on your browsers system cannot prevent a 3rd party from doing an own request to those APIs to get the location of the IP they naturally have. But indeed I haven't considered that scripts executed in your browser might do those requests and send the result to the 3rd party, rather than doing the lookup server-side 🤔. That makes quite sense to reduce server load and let clients do that instead 😄. But the goal would be to prevent the request to that script in the first place (hence have atlasobscura.com on the blocklist), rather then blocking requests done by the script which are used in very different context for the good as well? When I visit the article URL, with ads+tracker blocker enabled, I don't see that request done, while the API works, so in this case it is possible to block the underlying script without breaking the website. But 3 ads and 10 trackers blocked, huii 👍.

Is this really that common? Is it causing that much trouble? What I normally do if I need to know if my traffic is being routed through a VPN or not is go to DuckDuckGo and type "whats my ip"

DuckDuckGo uses www.whatsmyip.org and www.iplocation.net to show this info btw. So if by chance some ads/trackers would use that as well and for that reason those land in blocklists, that feature would be broken. And then see the search results, which one might then open instead, which are all domains that provide exactly that functionality, just like the ones in blocklists and would hence be completely inaccessible, while I'm not sure if user of e.g. Pi-hole (or when manually installing a public hosts file) would automatically get the idea that a blocklist is responsible rather than believing the website/API itself is broken.

Why I came across this is that we have two scripts, a shell login banner and a VPN client setup script, which both use such an API. We were already forced in the path to switch the API provider as the old one was found on much used blocklists and as we provide Pi-hole as install options explicitly, we have a relatively large user number with that functionality broken. The small provider we used then now has regular server problems, so we need to switch again. This time the idea was to use a large provider, with high or no rate limit and hence expectedly stable infrastructure. Since we need IPv6 support to quickly identify VPN IPv6 leaks, and short request times to not delay the shell login too much, the number of free public APIs that do not require an account/key is not that large anymore. And the best candidates are now found on several blocklists, probably because those tracking script developers choose the same APIs, for the same reason: IPv6 support, reliability and performance 😞.


Btw, don't take this as a request to remove the API's domain from your list, at least not yet 😉. For now I'm indeed evaluating whether it is generally reasonable to block such for privacy reasons or not and I guess we have not much chance to convince all maintainers of famous public blocklist in reasonable time to make a difference for our API decision for now. I guess we'll need to maintain an own API for our users and protect it via key or such so that it cannot be misused by tracker scripts 😄.

lightswitch05 commented 3 years ago

But the goal would be to prevent the request to that script in the first place (hence have atlasobscura.com on the blocklist)

That's certainly possible for Adblock-based blocking, but a hosts format is all or nothing - it can only block based on the domain and not the URL. But then, I cannot install an adblocker into Apps on my phone, and so they can make whatever requests they like, and the only option I'm left with is a VPN based blocker or a PiHole dns-level blocker (or both). I've considered doing adblock format lists, but I don't have the time for it right now. Hosts format is what I started with - so thats what I'm sticking to. That means I either block an entire domain, or nothing at all. That leaves me with making choices to block ao-freegeoip.herokuapp.com and allow the script which is loaded from assets.atlasobscura.com. I have no doubt if I blocked assets.atlasobscura.com that I would quickly receive a bunch of false positive requests about broken websites. ao-freegeoip.herokuapp.com is the smarter, less intrusive block.

I'm sorry these blocks are giving you trouble. Its very aggravating that advertisers and data hoarders abuse these tools to track people. I have the same issue with tools that are used for crash reporting and error detection. Those are commonly used to gather click events that are then sucked up into a larger system and used for tracking and segmenting people. Unfortunate I do not have a solution for legitimate uses vs. illegitimate uses. Everyone has their own opinion what is acceptable and what isn't. I've decided to take to take a opt-in approach and encourage users of this list to make up their own mind. I'll block things which that are ads or tracking tools - and encourage end users to opt-in to whatever they decide is a legitimate tool. I'm not anyone's moral police and don't want to become one. Its not always immediately clear, there are lots of gray areas and I can just do my best and encourage people to make their own decisions.

dnmTX commented 3 years ago

@lightswitch05 i got one for ya 😄

0.0.0.0 api.ipify.org Link: https://api.ipify.org/?format=jsonp&callback=getIP

Advert for NordVPN using the scare tactic that their IP is exposed: Online

And some additional info: Capture

MichaIng commented 3 years ago

assets.atlasobscura.com

At that is true, that domain serves a lot of assets for that website and in other cases it will be similar when the script is not served from a 3rd party domain but the same 2nd level domain as the website 🤔.

Its very aggravating that advertisers and data hoarders abuse these tools to track people.

In the example you mentioned the API is not used to track people but to make a decision whether GDPR has to be respected or not and if the location is missing that can be good for non-EU users but could be bad for EU users as well, when the script handles "no response" as no-EU. I just mean, in general there is a lot of guessing involved to derive that blocking these APIs helps for privacy.

I'm not anyone's moral police and don't want to become one. Its not always immediately clear, there are lots of gray areas and I can just do my best and encourage people to make their own decisions.

No need to defend yourself 🙂, I understand the situation and that domains land on lists often based on single cases while it sometimes is impossible to know in the first case whether it breaks legit things elsewhere. It's naturally subjective or based on the personal usage, whether the misuse cases overweight the legit cases or not, and that is perfectly fine.

I'm personally not convinced yet that generally blocking location API domains helps more than it damages, especially when following it consequently, but I see the theoretical case that tracker/software scripts can make use of those and can of course even send the location elsewhere, saving the recipient from the need to do IP-to-location look-ups by itself.

@dnmTX Thanks for the example. But the API call does nothing else than showing your IP in the foreseen field while the warning, if you don't like it, is still present, only incomplete. ~Same here, to get rid of NordVPN banners, block go.nordvpn.net itself and you'll never have any negative impact as long as you don't want to use NordVPN.~ EDIT: Ah sorry, nope it's the link target, not a script, the banner is HTML, which was likely not added by a script? But blocking the API in this case does not increase your privacy by a single bit 🙂. But yeah, that specific banner is ugly and more than nonsense 😄.

dnmTX commented 3 years ago

But the API call does nothing else than showing your IP in the foreseen field while the warning

@MichaIng you're right,but on the other hand(there is a small chat box there) got visitors scared(based solely on the IP field). Why give them the opportunity? It's a advertisement nevertheless which employs scare tactics.

But yeah, that specific banner is ugly and more than nonsense 😄

Right on 👍

MichaIng commented 3 years ago

Why should anyone be scared when seeing the own IP on a website you visit? Even if all APIs were blocked, a damn simple PHP could do the same, but you wouldn't be able to get your own public IP easily anymore.

If you visit an illegal website, where VPN becomes more relevant, then the info and an ugly but prominent would be actually helpful 😄. I just guess that banner is shown even when you're already using a VPN.

lightswitch05 commented 3 years ago

I don't particularly care about NordVPN's usage of it. However, there is certainly a lot of people using api.ipify.org. "Publishers Clearing House" use of the API stands out as being most likely tracking and ad related. https://publicwww.com/websites/%22api.ipify.org%22/

Anyways, I think we've fully explored the original question of why I add these to my list. For what its worth, I don't go out of my way to discover these services, and only add them when I see them being used in a questionable manner.

MichaIng commented 3 years ago

https://publicwww.com/websites/%22api.ipify.org%22/

The first match of asana.com is actually strange. A script adds <link rel="preconnect" href="https://api.ipify.org">, but that is never used as far as I can see. Instead I see a resource from geolocation.onetrust.com with IP and location info, but not where it is actually used. The website is available in multiple languages, which may play a role, although the native browser language API could be used for that.

"Publishers Clearing House" use of the API stands out as being most likely tracking and ad related.

pch.com uses the API for nothing else than a "Your public IP address is: xxx.xxx.xxx.xxx" text, no tracking, not even the location is derived from that particular call. But the website seems to be not fully up at the moment, with "Come back soon to PCH.com!" title, so probably it's different when all services are up.

I had a look through a few other cases, and could not find any tracking case, i.e. where the location or derived info is sent to a third party or so. When an API is often used, it first of all means that a lot of websites are affected when it's blocked, which could mean that they cannot track that easily anymore, but could also break wanted functionality or being basically a no-op by having your IP hidden at an unimportant place (which matches all cases I found so far). I see the theoretical possibility that such location info can be misused, and it's impossible to assure that it's never done for all websites, so it stays a subjective case and choice, which is fine.

I think we've fully explored the original question

True, thanks for taking your time and having a look into this 👍.

lightswitch05 commented 3 years ago

Here is a really fascinating blog post confirming my belief that most clients of these services are malicious: https://major.io/2021/06/06/a-new-future-for-icanhazip/

MichaIng commented 3 years ago

But you use such a service yourself via DuckDuckGo, which obviously is no malicious client (your browser).

The article shows quite well that the majority of such API calls are not done by the "infected" clients or visitors of malicious websites respectively, but those are done from hosts in China, hence server-side, where a client-side blocklist of course has no effect:

Almost all of that traffic came from several network blocks in China.

TL;DR


I understand the idea, when a certain domain is also called by bad scripts, to block that domain and hence limit the ability of the bad script. But in this case, where the domain provides a neutral service, which many people, including you, rely on by times, you block the service first of all for yourself (the users which apply your block list), while the bad guys which shipped the bad script have your IP anyway and nothing prevents them from getting the same information by doing the same API calls from their machines or running an own GEO IP service, which is what the large advertisers and tracking companies do (Google, Amazon, the advertising providers/services behind them).

Over 50% of popular websites use e.g. Google Analytics, where all data, MUCH more than your IP, is sent to Google and Google obtains the related locations and shows it to website owners, combined with specific client and OS info, system language, identifying account IDs, up to heat/click-maps and completely tracked user sessions with mouse movement and such. We avoid using such giant public data accumulating services and host things ourselves with limited and anonymized information, or use a neutral API service which does not get more than an IP and sends location info back, without having this connected to any other user information and stored forever in big tech company databases. Effective, without any negative impact for users, is to block the load of malicious scripts in the first place, which, I understand, is not always possible with DNS-based blocking (I guess Google Analytics uses the regular google.com domain). Blocking domains like icanhazip.com must be doubled frustrating for the maintainers of these services, not only that malware servers are using them, but regular good users/clients are even actively blocked by popular block lists, driving the bad:good ratio further up.

Finally, it is all a question of the individual case: While I'm arguing from a general perspective, looking for a reliable method to show our users their own IP and location with minimal data transfer (involving only a single party), for the individual user such a blocklist entry may not hurt or may prevent specific bad scripts from gathering location info within their own browser and sending it back somewhere else. Whether (or how many) bad scripts are actually doing it in that way is a different question, which may not even be relevant when you simply want to rule it out.

lightswitch05 commented 3 years ago

I wasn't really planning on re-opening this discussion, I just wanted to share an interesting article. But since you put a lot of thought into your response, I'll do likewise and respond.

The article shows quite well that the majority of such API calls are not done by the "infected" clients or visitors of malicious websites respectively, but those are done from hosts in China, hence server-side, where a client-side blocklist of course has no effect:

I think this is up for debate. I agree that the majority of traffic was being driven from China - but that is looking at it solely from a traffic point of view. It could have easily been a few really poorly coded clients that basically turned into a DOS attack. You can still have a wide variety of properly-coded clients that would contribute very little to the over all traffic.

There was a phase for a few years where malware authors kept writing malware that would call out to icanhazip.com to find out what they had infected. If they could find out the external IP address of the systems they had compromised, they could quickly assess the value of the target. Upatre was the first, but many followed after that.

This was more the paragraph that I felt applied here. With malware, they don't want to rely on C&C servers that might get shutdown, better to use a legitimate service like this. The exact same thing happened when GDPR went into affect. A lot of in-broswer javascript started using these services to determine the location of the client without contacting their servers. Since the initial rollout of GDPR, I've seen the use of these trend back down, but they are still out there.

I understand, is not always possible with DNS-based blocking (I guess Google Analytics uses the regular google.com domain)

It is easy to block google analytics, and I do in this project. They do not use their root google.com domain thankfully.

Blocking domains like icanhazip.com must be doubled frustrating for the maintainers of these services, not only that malware servers are using them, but regular good users/clients are even actively blocked by popular block lists, driving the bad:good ratio further up.

I'm not the moral police, nor am I a definitive source on what is legitimate vs. illegitimate. I encourage users of this list to opt-in to whatever services they decide to allow on their network. The primary use of my list is through the PiHole, which is a network-wide ad block tool via local DNS. It is very easy for people to see when something is being blocked by the PiHole, and then add an exception for the domain that should be allowed. People should make their own decision to allow it or not.

Story time. I have Chinese domains blocked via my pihole. One day I was reviewing what my pihole was blocking and I saw that an IoT weather station was requesting Chinese domains. I was baffled by this and sent an email to the company about it. The owner of the company replied to me telling me that I was wrong that that there was no way it was happening. I responded back that it appeared to only do it while I was having an internet outage. He talked to his developers, and turns out they were using some Chinese domains to determine if it had an active internet connection or not. The owner had no idea and they made an update to remove that behavior. It was nothing malicious, but without me having it blocked to begin with, I would have never know. Perhaps the owner of the company would have never known either. I want people to be aware of what is happening on their network. Having to opt-in to things like this can aid them to better understand what devices are talking to what servers. I think that is a good thing overall. Having people opt-in is the path to understanding. If the user cannot be bothered to opt-in, then the feature likely isn't that important to them.

MichaIng commented 3 years ago

This was more the paragraph that I felt applied here.

Ah true. I wonder why such malware uses such APIs to get the hosts IP, as when they contact back home, the IP is included anyway. For location data it may save resources when the evil server does not need to do those requests as the infected clients did this already.

A lot of in-broswer javascript started using these services to determine the location of the client without contacting their servers.

But if they don't contact their servers, then what is the whole purpose of it, how does calling an IP/location API then affect your privacy if the data is only stored/processed on your system but never sent to a bad guys server 🤔?

It is easy to block google analytics, and I do in this project. They do not use their root google.com domain thankfully.

Ah (www.)google-analytics.com, just found it 👍.

The primary use of my list is through the Pi-hole, which is a network-wide ad block tool via local DNS.

Luckily also our users are running mostly into this issue in combination with Pi-hole, which we offer it as dedicated install option and can do pihole -w <hostname> to whitelist the specific API we use. But another case then was NextDNS, which as well has the specific API hostname on one of their server-side blocklists. Finally we'll treat the responses in case of failing DNS special, and print a little info to check blocklists and in case whitelist the domain, where the own public IP and location would otherwise show up, and have the issue then addressed as good as possible, without sneaky libc-ares overrides.

turns out they were using some Chinese domains to determine if it had an active internet connection or not

🤣 why the hack would one use a Chinese domain for this. But we had a similar issue: The IP/domain that shall be used for connectivity/DNS resolving checks can be chosen, but finding a good default took a while: Clear to us was a multicast IP/domain so that it is nearly assured to be reachable with good response time from anywhere in the world. We used Google DNS first, but I didn't want to involve Google for anything explicitly. Compared a few other public DNS services, and Cloudflare 1.1.1.1/one.one.one.one was by far fastest from locations I was able to test. But then it turned out that this IP is used by some older routers as internal IP (in cases the whole 1.0.0.0/8 range), hence not resolved upstream, even that it was never officially reserved for LAN/local purpose, and for some reason many Chinese IPs block it as well. Finally Quad9 turned out to be a reliable default.

I want people to be aware of what is happening on their network.

To be true, instead of a blocklist, a firewall which requires EVERY domain to be whitelisted once, is then required. I tried it once only on application level on Windows, and it is nearly impossible to do it there, without constantly breaking updates and such. The amount of different executables and services that contact a huge amount of different hosts for various purposes, completely in-transparent, is overwhelming. But on a Linux server, very doable, for browsing again, at least at first very annoying I guess 😄.