Explore analyzing web resources

privacy-tech-lab / privacy-pioneer

Privacy browser extension for analyzing web traffic of visited websites

https://www.privacytechlab.org/

Other

28 stars 1 forks source link

Explore analyzing web resources #7

Closed davebaraka closed 3 years ago

davebaraka commented 4 years ago

Web apps' resources (html, css, js, etc) are downloaded when accessed through a browser. Generally, as a user navigates through a website, more resources are locally downloaded and content is dynamically loaded.

The question is whether we can identify privacy practices from these resources. In this exploration, we'll see if we can identify if a web app is attempting to access certain permissions from these web resources. Web permissions as defined by W3C include

  "geolocation",
  "notifications",
  "push",
  "midi",
  "camera",
  "microphone",
  "speaker-selection",
  "device-info",
  "background-fetch",
  "background-sync",
  "bluetooth",
  "persistent-storage",
  "ambient-light-sensor",
  "accelerometer",
  "gyroscope",
  "magnetometer",
  "clipboard-read",
  "clipboard-write",
  "display-capture",
  "nfc",

The challenge here is if we can identify these permssion apis from obfuscated code. If we have access to the development environment, this may be less of an issue.

Another potential challenge to explore is how can we programmatically intercept/retrieve these resources. Currently, I can see two possible ways. First is through a headless browser, such as puppeteer, and second is through the webRequest api of a browser extension. This can be done manually using a browser's dev tools.

Another option to identify a web app's permission usage is through a headless browser. We can crawl the website and capture the side-effects of certain permissions. For instance, if a web app requests camera access, we can watch for an error (as a headless browser does not have a camera).

davebaraka commented 4 years ago

In this exploration, we'll see if we can identify if a web app is attempting to access certain permissions from these web resources

The challenge here is if we can identify these permssion apis from obfuscated code.

I was able to identify permission APIs from obfuscated code. I did this by manually looking through the downloaded web resources of a website from the chrome dev tools. I chose sites based on whether chrome indicated they were asking for a certain permission, so I knew that these sites were requesting these permissions.

Below are snippets of code from downloaded javascript files.

Accessing location from Google Maps

API: navigator.geolocation.getCurrentPosition(success[, error[, [options]])

Code snippets:

        _.F.Bf = function(a) {
            var b;
            this.Ca || (b = a.callback(function() {}, "mylocation.redrawImpl", void 0, "mls"));
            var c = this.$.get();
            c ? (this.V = !0,
            this.ka.add(c, a)) : this.V && (this.ka.remove(a),
            this.V = !1);
            this.ha.get() && (this.H.W[2] = this.Da,
            this.H.W[4] = !!_.Cp.navigator.geolocation);
            this.Ca = !0;
            this.Aa.render(b)
        }

            a.V.getCurrentPosition(function(h) {
                var l = {
                    Lo: f,
                    callback: g
                }
                  , m = void 0 === l ? {} : l;
                l = void 0 === m.Lo ? !1 : m.Lo;
                m = void 0 === m.callback ? void 0 : m.callback;
                if (a.H) {
                    var q = a.H;
                    a.H = null;
                    a.Aa.stop();
                    a.kc = WY;
                    _.sz(a, a.kc, q);
                    h.coords.accuracy && a.Ca && a.Ca.write(h);
                    h.coords && (_.Cz(a.T) && c ? jrf(a, h.coords, !1, q, {
                        Lo: l,
                        callback: m
                    }) : c || mrf(a, h.coords, Kpf, q));
                    q.done(Ppf)
                }
            }, function(h) {
                if (a.H) {
                    var l = a.H;
                    a.H = null;
                    a.Aa.stop();
                    var m = a.kc;
                    h.code == h.PERMISSION_DENIED ? a.kc = Lpf : h.code == h.POSITION_UNAVAILABLE ? a.kc = Opf : h.code == h.TIMEOUT && (a.kc = Npf);
                    a.ka.set(null, l);
                    a.kc != m && _.sz(a, a.kc, l);
                    g && g(!1, l);
                    l.done(Ppf)
                }
            }, {
                enableHighAccuracy: !0,
                timeout: 1E4,
                maximumAge: 3E5
            }),
            a.Aa.start(b))
        };

Accessing Notifications from Google Meet

API: Notification.requestPermission();

Code snippet:

        _.ted = function(a) {
            if (a.Aa || !window.Notification || "function" !== typeof Notification.requestPermission)
                return _.db("denied");
            if (!Notification.permission || "default" == Notification.permission) {
                var b = _.db(Notification.requestPermission());
                b.then(function(c) {
                    "granted" == c ? a.tb.logImpression(3549) : "denied" == c ? a.tb.logImpression(3550) : a.tb.logImpression(3624)
                });
                return b
            }
            return _.db(Notification.permission)
        }
        ;

Accessing Camera and Microphone from Google Meet

API: navigator.mediaDevices.getUserMedia({ audio: true, video: true });

Code snippets:

        _.wAc = function(a) {
            var b = _.lf();
            return b && b.mediaDevices && b.mediaDevices.getUserMedia ? _.db(b.mediaDevices.getUserMedia(a)) : _.Tf("Missing getUserMedia API.")
        }

            var a = {
                audio: !1,
                video: !1
            }

Accessing Sensors for Adidas (Sensors is an umbrella term for 'gyroscope’, ‘magnetometer’, ‘ambient-light-sensor’ and more. Here, I'm a little less sure about what sensor Adidas was specifically requesting...)

API: window.addEventListener("deviceorientation", handleOrientation, true);

Code snippet:

            DeviceMotionEvent: qt,
            DeviceOrientationEvent: qt,

We also know which domain these files are coming from. All in all, this could be one thing we could use as an indicator that a web app is accessing a certain permission.

rgoldstein01 commented 4 years ago

This is great!

SebastianZimmeck commented 4 years ago

Indeed, nice work! Leaving this issue open for the time being in case there is more to discuss on this topic.

SebastianZimmeck commented 4 years ago

One point @davebaraka mentioned to look into is the reliability of statically analyzing the permissions. How reliable is it? Maybe, picking a small test set of sites and check both manually and with @davebaraka's technique comparing the results leads to some insight here.

davebaraka commented 4 years ago

Here are 25 sites that I analyzed using the static analysis technique similar to PFP. I looked for keywords through all the web resources (including html, css, etc) retrieved from the main link. For each URL link below, I looked for the following permissions LOCATION NOTIFICATIONS PUSH CAMERA MICROPHONE CLIPBOARDREAD. It was difficult to find and invoke the permissions from the browser since some sites have hundreds of buttons/links. With that said, there are many false positives, but the permissions that are invoked by the browser are detected from the resources.

Overall, the analysis could be better by only scanning javascript code and possibly using some taint analysis.

Now it would be interesting to look at open-source projects and try to mimic a developer experience, as the code would not be 'compiled'. Then we could possibly have a better understanding if some these permissions are coming from third-party libraries.

URL	PERMISSIONS INVOKED BY BROWSER	PERMISSIONS DETECTED FROM RESOURCES
https://www.youtube.com/	`NOTIFICATIONS`	`LOCATION` `NOTIFICATIONS` `PUSH` `CAMERA` `MICROPHONE` `CLIPBOARDREAD`
https://www.apple.com/	-	`CLIPBOARDREAD`
https://www.linkedin.com/	-	-
https://www.linkedin.com/jobs/engineering-jobs-middletown-ct?trk=homepage-basic_suggested-search&position=1&pageNum=0	-	-
https://www.amazon.com/	-	`CLIPBOARDREAD`
https://github.com/	-	`CLIPBOARDREAD`
https://www.netflix.com/	-	`CLIPBOARDREAD`
https://www.msn.com/	-	`LOCATION`
https://www.twitch.tv/	-	`PUSH` `CLIPBOARDREAD`
https://www.tiktok.com/	-	`CLIPBOARDREAD`
https://www.roblox.com/	-	`CLIPBOARDREAD`
https://www.cnet.com/	-	`LOCATION` `NOTIFICATIONS` `PUSH` `CLIPBOARDREAD`
https://twitter.com/	-	`PUSH` `CLIPBOARDREAD`
https://www.adobe.com/	-	-
https://www.wesleyan.edu/	-	-
https://www.wired.com/story/rip-google-play-music-gone-too-soon/#intcid=_wired-right-rail_58890062-b4e2-472b-a98b-d87b8d31bd50_popular4-1	-	`LOCATION` `CLIPBOARDREAD`
https://www.google.com/search?q=medium&oq=medium&aqs=chrome..69i57.1247j0j9&sourceid=chrome&ie=UTF-8	`LOCATION`	`LOCATION`
https://medium.com/@PhillipStutts/election-analysis-exclusive-here-is-what-will-happen-on-nov-3-f6426c3d83e7	-	`CLIPBOARDREAD`
https://www.pbs.org/newshour/politics/north-carolina-to-keep-4-sites-open-longer-delaying-results	-	`NOTIFICATIONS` `CLIPBOARDREAD`
https://www.rottentomatoes.com/tv/blood_of_zeus/s01	-	`LOCATION`
https://www.vox.com/recode/2020/11/2/21541880/wikipedia-presidential-election-misinformation-social-media	-	`LOCATION`
https://www.walmart.com/ip/Pikmin-3-Deluxe-NINTENDO-GAMES-Nintendo-Switch/989530281	-	`CAMERA` `MICROPHONE` `CLIPBOARDREAD`
https://www.reddit.com/	`NOTIFICATIONS`	`NOTIFICATIONS` `PUSH` `CLIPBOARDREAD`
https://www.yahoo.com/	-	`NOTIFICATIONS` `LOCATION` `CLIPBOARDREAD`
https://www.ebay.com/		`NOTIFICATIONS` `CAMERA` `MICROPHONE`

SebastianZimmeck commented 4 years ago

These are interesting results. It seems to me this technique, at least in its current form, is not very reliable. That said, it is good to know that.

It was difficult to find and invoke the permissions from the browser since some sites have hundreds of buttons/links.

Maybe this problem could be addressed based on the assumption that the developer would follow a tutorial that asks something along the lines of "Now, navigate to all the pages on your site that trigger a permission." Ideally, we would not make such requirement because it would require the developer to remember where the permissions are triggered. But if it is the only or clearly best way to make it work, we can make that assumption.

Overall, the analysis could be better by only scanning javascript code and possibly using some taint analysis.

So, the current analysis is just based on HTML tags? Scanning Javascript may indeed add additional insights. One point to consider here is that often Javascript is loaded dynamically, which would necessitate a real browser (e.g., Selenium) as opposed to just web scraping (e.g., using Beautiful Soup).

davebaraka commented 4 years ago

I looked at a few open source websites, and similar to the results before, I detected the CLIPBOARD-READ permission in most of them. Though the results were right in detecting this permission as I was able to trace down the libraries that used this permission - https://github.com/sudodoki/copy-to-clipboard#readme and https://clipboardjs.com. One of them however was incorrect, but this was because the method used to detect these permissions scanned all text from the resources, not just javascript.

There can definitely be some improvements on this technique, but with this we could provide to a developer what permissions a third party library is potentially using in their app by analyzing a development project versus a production product. Additionally, if we couple this with http request interception, assuming we could detect from which file an http request is being fired, we could draw some conclusions about what data is being shared. This ties with Analysis Features and Techniques 2.

Thinking more about Analysis Features and Techniques 2, specifically Identifying code that is associated with certain libraries or identifying third party names in code may be useful as well, and looking a bit into Facebook Pixel , one technique that comes to mind is observing the way an app behaves when performing user actions. For instance, what http requests or function calls are being fired when a user clicks a button or scrolls down on a page, and does this behavior correlate to tracking or collection of some data. Granular data from these events can be powerful fingerprinting techniques for companies.

For Analysis Features and Techniques 3 it may be necessary to get a user's cookie consent and have a cookie banner on the site. Is there such banner?. Thinking about analyzing the behavior of an app, we could analyze to see how an app responds when setting a cookie and denying a request to set a cookie. What happens to the contents of the cookie when a user denies a request to allow cookies. Similarly, can we verify the compliance of a website with the CCPA such that a submission from a ‘Do Not Sell’ form actually sends a request. Also, the Website Evidence Collector has some good starter code for identifying first and third party cookies.

To sum up, observing an app’s behavior when directing them through forms via injected javascript or crawling may prove to be a useful method in the analysis.

SebastianZimmeck commented 4 years ago

For instance, what http requests or function calls are being fired when a user clicks a button or scrolls down on a page, and does this behavior correlate to tracking or collection of some data.

Thinking about analyzing the behavior of an app, we could analyze to see how an app responds when setting a cookie and denying a request to set a cookie. What happens to the contents of the cookie when a user denies a request to allow cookies. Similarly, can we verify the compliance of a website with the CCPA such that a submission from a ‘Do Not Sell’ form actually sends a request.

These are very good directions. Especially, as it is my sense that we want to do as much dynamic analysis as possible to keep false positives low. So, less looking at code in files and more trying to observe what is actually happening (which still leaves room for false positives, e.g., we may misinterpret a certain HTTP request, but there will be probably fewer).

SebastianZimmeck commented 4 years ago

Per our discussion today, @davebaraka will look a bit more into dynamically detecting resource use. (This may come down to intercepting HTTP request as well, though, there may be also different techniques).

davebaraka commented 3 years ago

I started to look into cookies a bit more and there does not seem to be a clear way to get all the cookies in selenium. I can get the first-party cookies from selenium and its data after the page loads, but I’m not able to watch for changes. To avoid spending too much time trying to get this functionality to work in Selenium, I explored this functionality using Puppeteer. Here we know when a cookie is being written and read and identify the data in that cookie. Similarly, we can watch for reads and writes to local storage and get the data from there as well. We can also figure out which files these functions are being called from and whether it's a first or third party. This functionality is driven by the ability to override javascript functions. We inject code that intercepts the reads and writes.

Looking at the data of some of these cookies in chrome, most of the data seems nonsense, but there are a few keys that stood out, such as ‘ad_id’ ad_privacy’, ‘uuid’, and ‘geoData’. The same keys and values, specifically ‘uuid’ appeared in multiple cookies from different domains. Additionally, as I scrolled on some websites, more cookies were being set. There were also some notable analytics companies such as chartbeat and AppNexus, which stored a ‘uuid’.

I also realized a bit late that I had an adblocker running, which was preventing many third party libraries from being set.

I started reading this article about fingerprinting and came across a fingerprinting js package - Fingerprint2.js. This leads me to question how could we detect a package like this from resources?

SebastianZimmeck commented 3 years ago

This functionality is driven by the ability to override javascript functions.

One point here is also that cookies can be set by frontend and backend code in Javascript, PHP, ... So, ideally, we would be able to detect all of these.

I also realized a bit late that I had an adblocker running, which was preventing many third party libraries from being set.

Haha, yeah, that would be good to turn off and try again if it makes any difference.

This leads me to question how could we detect a package like this from resources?

That would be great. There are some known advertisers, such as Bluecava and AdTruth that use(d) fingerprinting. So, maybe, fingerprinting identification can be done based on identifying the server of an ad network.