privacy-tech-lab / privacy-pioneer

Privacy browser extension for analyzing web traffic of visited websites
https://www.privacytechlab.org/
Other
27 stars 1 forks source link

Look into detecting browser fingerprinting #79

Closed rgoldstein01 closed 3 years ago

rgoldstein01 commented 3 years ago

The next category of permission we should look into is fingerprinting. @danielgoldelman is going to look at several sites where we know fingerprinting is taking place, and look for any patterns we can find with the requests. I will be taking a look, as well.

cc @SebastianZimmeck

SebastianZimmeck commented 3 years ago

@danielgoldelman, please move this forward, and let's discuss on Wednesday.

You can start by generally learning what browser fingerprinting is (also here).

Then, how is it possible to identify fingerprinting in HTTP requests? For example, if a site is using fingerprint.js, is there something in the HTTP traffic that is giving that away? These site are claimed to use that library. Try checking out if you can see anything using your browser dev tools or Fiddler or something similar.

Also, what are the main fingerprinting libraries/services?

danielgoldelman commented 3 years ago

Overview of Fingerprinting:

Browser Fingerprinting is a method of identifying individuals on the internet via scripts loaded onto websites that interpret things about your computer, without a visual notifier or form requesting access, like we now see with websites asking for cookies to be enabled. Importantly, browser fingerprints, unlike cookies, are not deleted when you clear your history. They even work when in incognito or when you have disabled cookies. This method of identification is performed using a selection of different technologies, the main ones detailed here:

  1. The HTML Canvas Element (Canvas API)

HTML introduced the Canvas element in HTML5. It is used to draw graphics into an HTML document. It has many practical uses for UI design. Browser fingerprints are generated using the Canvas element through the use of the .toDataURL method in JavaScript, which converts the Canvas drawn by your computer into binary, and a hashing function, which interprets the binary code. Each browser will draw the Canvas differently, and so a unique identifier can be generated based on the way that your computer drew the element. (Also see WebGL and render fingerprinting)

  1. Audio Identification (AudioContext API)

Browser fingerprinting software can instruct your computer to play a sound, and the way that your computer creates the necessary soundwaves can be read and interpreted by fingerprinting scripts. Essentially, when your computer creates an audio signal, there are many pieces of hardware that are utilized before the sound is heard. The audio setup in your computer, or audio stack, is potentially unique to computer models, and so unique computer models can be identified. The scripts could also potentially see what audio devices your computer is connected to via the same methods.

  1. Tracking Cookies

Tracking cookies are now fairly well known to most web users. They download small packets of code that are able to track your browsing history. They can also interpret facts about your computer, like the screen resolution, what browser you are using, and what plugins and extensions you have installed to your browser. Used together with other fingerprinting methods, unique identifiers are very simple to create, store, and track across websites.

  1. Browser Elements / Battery Status

Other less used methods include checking for plugins using JS scripts, checking features of JS allowed by the browser, and APIs that check different elements of your browser and computer. A few examples are the Battery Status API and TextMatrix API.

https://pixelprivacy.com/resources/browser-fingerprinting/ https://www.avast.com/c-what-is-browser-fingerprinting https://en.wikipedia.org/wiki/Device_fingerprint https://blokt.com/guides/browser-fingerprinting https://arxiv.org/pdf/1905.01051.pdf

Recognising Browser Fingerprinting

In general, finding evidence of browser fingerprinting is difficult. Some sites have single files that use the names of popular fingerprinting libraries like FingerprintJS (some alternatives to which are modifications to jQuery or Modernizr, enabling fingerprinting capabilities), but the majority of them bury the fingerprinting functions within other JavaScript code. The main issue with identifying fingerprinting is that websites generally encode the file names and associated URLs, so it is very difficult to identify which files may be using fingerprinting technologies. A few technologies have been used to test identifying fingerprinting:

  1. HTTP Attributes

A few HTTP attributes have the potential to fingerprint individuals on the web. Research has been done on the possibility of identifying URLs that are sending this data by analysing the URLs directly. This method is not fruitful for two main reasons: first, URLs are now often encoded, so the data being sent would not be flagged, and second, false positives are abundant since there are non-fingerprinting applications to those attributes.

  1. Machine Learning Algorithms

Studies on the issue have attempted to use machine learning softwares to identify such URLs, but have had major troubles with this approach. Browser fingerprinting is stateless, meaning that it does not store information locally on an individual’s computer. Much of the machine learning softwares written to identify tracking on the web are written to track stateful (information stored locally) softwares. Efforts to use machine learning thus have not been very successful.

I will put more research into FP-Tracker, which is a browser extension that intends to identify fingerprinting websites. See their paper and github here: https://web.cs.ucdavis.edu/~zubair/files/fpinspector-sp2021.pdf https://github.com/uiowa-irl/FP-Inspector

I looked through potential and confirmed HTTP requests that are fingerprinting using browser tools, Fiddler, and Postman. Here are a few findings:

  1. Some files are directly named fingerprint3.js or fp.js, which obviously contains the code related to the Fingerprint.js framework. Various sites used this named file, but this was a very small subset.
  2. Some requests send JSON-like strings containing relevant information in their Query String Parameters. These were easily noticeable, but were few and far between.
  3. While looking through some hashed URLs, the actual file includes references to fingerprint.js - related functions. This was found by inspecting the file and using Command+F, then looking up “fingerprint” and other strings that would likely show up in a fingerprint.js script. There was no indication (to me) that the document or request was related to fingerprinting without looking specifically for this.
  4. The canvas element obviously does not have to be in the original HTML file, but can instead be loaded by the JS file that is fingerprinting. Searching directly for a canvas is not a guarantee that the site is trying to fingerprint.
SebastianZimmeck commented 3 years ago

Great overview, @danielgoldelman! I am thinking we should keep it simple. If we can, identify if a site makes use of

I do not think that we necessarily need to be sure that a site is doing fingerprinting (though, in case of fingerprint.js it is pretty clear); rather we can say there is risk of fingerprinting

So, @danielgoldelman, it would be helpful if you could come up with a list of existing fingerprint libraries beyond fingerprint.js. If you wanted to implement fingerprinting on your site, what libraries can you use?

The second question is, which HTML5 APIs should we look for beyond the canvas API? A list would be good (maybe, it is very short).

Once you have identified the libraries and APIs, we can go ahead an implement the detection functionality.

We probably want to have some data layer (e.g., YAML or JSON) where we store the names of the libraries and HTML5 APIs.

websites generally encode the file names and associated URLs, so it is very difficult to identify which files may be using fingerprinting technologies

If this is about decrypting HTTPS, it is not a problem. We can do that as we are in the browser. In your research, you can use a fake root certificate.

danielgoldelman commented 3 years ago

Fingerprinting Libraries / Services

Note that all of the libraries/services listed below have fingerprinting capabilities, but the most important are fingerprint.js, nmap, addthis, sift, and the MediaMath fingerprinting Script. This is not a fully comprehensive list. There are many more examples of fingerprinting libraries and services available, but through my research, these are the most important.

APIs used for fingerprinting

Note that all of the listed APIs have fingerprinting capabilities, but not all of them as as widely used. The Canvas, WebGL, WebAudio/AudioContext, Font Recognition, and BatteryStatus APIs are the important ones to focus on. The Navigator and Window APIs are also very important, but are likely not going to be a good focus for our purposes.

My next step will be to provide code examples for the most important implementations of these libraries / services / APIs being used for fingerprinting. After that, we can start implementing the detection functionality.

SebastianZimmeck commented 3 years ago

My next step will be to provide code examples for the most important implementations of these libraries / services / APIs being used for fingerprinting.

Excellent, @danielgoldelman! I'd say, no need to spend a whole lot of time on the APIs. In many cases there may be legitimate reasons for using those. The more important part are the fingerprinting libraries. Can we identify those without false positives or low false positive rates (false positive == mistakenly assuming a non-fingerprinting library is a fingerprint library)? False negatives may also be a problem.

SebastianZimmeck commented 3 years ago

@danielgoldelman, this could be relevant (related Hacker News thread; maybe there are knowledgeable people on there with insights).

danielgoldelman commented 3 years ago

Libraries

I definitely agree with not deeply investigating the use of certain APIs and their related methods. The most popular fingerprinting APIs have many legitimate uses, and thus are overwhelmingly prevalent on the web. In the interest of not having false positives, our time would better be spent investigating the libraries that have fingerprinting capabilities or actively state that they are fingerprinting.

The difficulty with identifying these libraries is that there are many different ways to implement them. fingeprintJS, the most popular library by far, has many different implementations. A few are listed below:

var e = function(e) {
    var t = {
        swfContainerId: "fingerprintjs2",
        swfPath: "flash/compiled/FontList.swf",
        detectScreenOrientation: !0,
        sortPluginsFor: [/palemoon/i]
    };
    this.options = this.extend(e, t), this.nativeForEach = Array.prototype.forEach, this.nativeMap = Array.prototype.map
};
function _0x46db02() {
    var _0x39dd11 = document[isl6_0x2b5b('0xe')](isl6_0x2b5b('0xf'));
    var _0x5927cb = _0x39dd11[isl6_0x2b5b('0x10')]('2d');
    var _0x479968 = isl6_0x2b5b('0x26');
    _0x5927cb[isl6_0x2b5b('0x27')] = isl6_0x2b5b('0x28');
    _0x5927cb['font'] = '14px\x20\x27Arial\x27';
    _0x5927cb[isl6_0x2b5b('0x27')] = isl6_0x2b5b('0x29');
    _0x5927cb[isl6_0x2b5b('0x2a')] = isl6_0x2b5b('0x2b');
    _0x5927cb[isl6_0x2b5b('0x2c')](0x7d, 0x1, 0x3e, 0x14);
    _0x5927cb[isl6_0x2b5b('0x2a')] = isl6_0x2b5b('0x2d');
    _0x5927cb[isl6_0x2b5b('0x2e')](_0x479968, 0x2, 0xf);
    _0x5927cb[isl6_0x2b5b('0x2a')] = isl6_0x2b5b('0x2f');
    _0x5927cb[isl6_0x2b5b('0x2e')](_0x479968, 0x4, 0x11);
    return _0x39dd11['toDataURL']();
}
function initFingerprint() {
    Fingerprint2.get(function(e) {
        fingerprint_hash = Fingerprint2.x64hash128(e.map(function(e) {
            return e.value
        }).join(), 31);
        var t = {};
        for (i = 0; i < e.length; i++) {
            var r = e[i].key;
            "canvas" !== r && "webgl" !== r && (null != lies[r] && (lies[r].value = e[i].value - 0), t[r] = e[i].value)
        }
        fingerprint_json = JSON.stringify(t)
    })
}
hasFingerprint() && window.requestIdleCallback ? requestIdleCallback(initFingerprint) : hasFingerprint() && setTimeout(initFingerprint, 500)

This site has many examples of sites using fingerprinting or cryptomining, and has a lot of code examples to look through.

Given the variety in ways that fingerprint.js is set up and used by various sites makes identifying its use very difficult. The most commonly used code of fingerprint.js (<50% of fingerprint.js scripts use) is this:

swfContainerId: "fingerprintjs2",
swfPath: "flash/compiled/FontList.swf",

Of course, if we see these lines of code, there is a 100% chance that the site is fingerprinting users.

The Sift fingerprinting library will include this code:

var _user_id = 'al_capone'; // Set to the user's ID, username, or email address, or '' if not yet known.
var _session_id = 'unique_session_id'; // Set to a unique session ID for the visitor's current browsing session.

var _sift = window._sift = window._sift || [];
_sift.push(['_setAccount', 'INSERT_BEACON_KEY_HERE']);
_sift.push(['_setUserId', _user_id]);
_sift.push(['_setSessionId', _session_id]);
_sift.push(['_trackPageview']);

(function() {
function ls() {
    var e = document.createElement('script');
    e.src = 'https://cdn.sift.com/s.js';
    document.body.appendChild(e);
}
if (window.attachEvent) {
    window.attachEvent('onload', ls);
} else {
    window.addEventListener('load', ls, false);
}
})();

Castle uses many different code bases, all provided open source via their github. See Castle’s example HTTP requests, webhook example, and their npm package. Sites using Castle will include the following JavaScript code:

<script type="text/javascript" src='dist/c.js'></script>
<script type="text/javascript">
  _castle('setAppId', 'YOUR_CASTLE_APP_ID');
</script>

The below object is remarkably common and makes many calls to different libraries.

{"is_audio": true, "is_canvas": true, "is_webrtc": false, "is_canvas_font": false, "audio_api_calls": [{"value": "", "symbol": "OfflineAudioContext.createOscillator", "arguments": null, "operation": "call"}, {"value": "", "symbol": "OfflineAudioContext.createDynamicsCompressor", "arguments": null, "operation": "call"}, {"value": "{}", "symbol": "OfflineAudioContext.destination", "arguments": null, "operation": "get"}, {"value": "", "symbol": "OfflineAudioContext.startRendering", "arguments": null, "operation": "call"}, {"value": "FUNCTION", "symbol": "OfflineAudioContext.oncomplete", "arguments": null, "operation": "set"}], "canvas_api_calls": [{"value": "#f60", "symbol": "CanvasRenderingContext2D.fillStyle", "arguments": null, "operation": "set"}, {"value": "#069", "symbol": "CanvasRenderingContext2D.fillStyle", "arguments": null, "operation": "set"}, {"value": "11pt no-real-font-123", "symbol": "CanvasRenderingContext2D.font", "arguments": null, "operation": "set"}, {"value": "", "symbol": "CanvasRenderingContext2D.fillText", "arguments": "[\"Cwm fjordbank glyphs vext quiz, 😃\",2,15]", "operation": "call"}, {"value": "rgba(102, 204, 0, 0.2)", "symbol": "CanvasRenderingContext2D.fillStyle", "arguments": null, "operation": "set"}, {"value": "18pt Arial", "symbol": "CanvasRenderingContext2D.font", "arguments": null, "operation": "set"}, {"value": "", "symbol": "CanvasRenderingContext2D.fillText", "arguments": "[\"Cwm fjordbank glyphs vext quiz, 😃\",4,45]", "operation": "call"}, {"value": "rgb(255,0,255)", "symbol": "CanvasRenderingContext2D.fillStyle", "arguments": null, "operation": "set"}, {"value": "rgb(0,255,255)", "symbol": "CanvasRenderingContext2D.fillStyle", "arguments": null, "operation": "set"}, {"value": "rgb(255,255,0)", "symbol": "CanvasRenderingContext2D.fillStyle", "arguments": null, "operation": "set"}, {"value": "rgb(255,0,255)", "symbol": "CanvasRenderingContext2D.fillStyle", "arguments": null, "operation": "set"}, {"value": "", "symbol": "HTMLCanvasElement.toDataURL", "arguments": null, "operation": "call"}]}

APIs

While it is true that analysing the use APIs will bring up circumstantial evidence of fingerprinting, I would argue that investigating them is still worthwhile. Below is a list of methods used by these APIs that are consistent throughout fingerprinting libraries. If these show up, I recommend that we notify a user that the site may be fingerprinting. I bring this up because I did not find a lot of ways to identify the fingerprinting libraries, outside of the blatant examples above and a few less notable libraries’ code. Most sites have obfuscated their fingerprinting code, so identifying the libraries takes a lot of machine learning. The APIs are significantly more common to see, of course, so this may still be worthwhile to investigate and identify to users of our plugin.

Canvas:

Fonts/Flash:

There are a few more examples of API methods that are specifically used, which I will continue to look into.

rgoldstein01 commented 3 years ago

@danielgoldelman has condensed his list of keywords. Still needs to be refined a bit. We should do this in our meeting today.

rgoldstein01 commented 3 years ago

I have added @danielgoldelman 's keywords to the search functionality.

rgoldstein01 commented 3 years ago

The next step will be testing but that is true for all of our features. I think we can close this issue for now but will wait for our discussion Wednesday @SebastianZimmeck @danielgoldelman

danielgoldelman commented 3 years ago

Agreed.

For later reference, the keywords are specific to the most popular libraries and the object included in my comment here (see just above the API section). This should assist with flagging certain use of the most popular fingerprinting libraries.

rgoldstein01 commented 3 years ago

One thing I did not do is add your comments @danielgoldelman to the importJson.js file so if you could go in and do that just to comment in your docuemntation that would be great.

danielgoldelman commented 3 years ago

Review of the progress of this issue:

Future efforts into this functionality could be made using machine learning and other methods of detecting browser fingerprinting, but at this point we have both the list compiled through the work on this issue (which will give us certain proof of browser fingerprinting) and the disconnect list. These together form a wide net that should catch most fingerprinting activity.

rgoldstein01 commented 3 years ago

This is good to close. Testing to come..