Ensure data capture maps to what is blocked by Tracking Protection

Right now, our capture code compares documentUrl to targetUrl for an HTTP response object to determine whether or not a third party request is being made.

This does not take into consideration, for example, that some third parties can also make their own third party requests, potentially in a chain.

We should consider how we are capturing this data, and if we are leaving out some requests, including storing request chains (i.e. we may want to store some of our keys like documentUrl as an array, rather than a single key-value pair).

@jonathanKingston , could you elaborate more on some of these less obvious ways that third party requests can be made, and the best way to check for them?

Many ad exchanges & networks use <script src="..."> and/or <iframe> elements that make 3rd-party requests.

E.g., a site includes rubicon project <script> tag, which adds an <iframe> containing their cookie-syncing code, which draws more <iframe> elements to their partners. (See line 184 of their cookie-syncing code) (copied from here to a gist for posterity.)

@biancadanforth if the first party has a service worker then it might decide to load scripts/images/css in the page which it could control in the serviceworker. This from my testing would be ignored by our code.

So for example:

twitter.com has a SW
twitter.com has google analytics embedded
after install of the SW any script loads to GA would go through the SW being ignored by our normal checks.

Foreign fetch when implemented would make this situation worse, with wider exposure. For now please ignore this though.

Moving a discussion from PR #115 to this issue:

@biancadanforth said:

I had to add some unglamorous fixes for some old but newly salient bugs in capture.setThirdParty. Let me know what you think. I'll let you decide whether the DNM flag should be removed.

Here's an explanation of the two capture bugs:

browser.tabs.get(tabId) throws an error (and exits the script) if the tabId is -1!

An error: Error: Invalid tab ID: -1 was showing up in the console and as soon as that happens, the WebExtension API continues to listen for and fire off events. It still calls processNextEvent, but it never enters the if (this.queue >= 1) statement with the switch statement and the call to process the next event. In other words, after this error, this.processingQueue is always true and ignored is always false, so it always just gets into the return if statement in processNextEvent.

Just before the error occurs, the previous processNextEvent event was from the webRequest.onResponseStarted event; here's the response object that caused the failure:

{
    documentUrl: “http://www.youtube.com/”,
    frameId: -1,
    fromCache: true,
    Method: “GET”,
    originUrl: “https://www.youtube.com”,
    parentFrameId: -1,
    requestId: “887”,
    statusCode: 200,
    statusLine: “HTTP/2.0 200 OK”,
    tabId: -1,
    timeStamp: 1502310649656,
    Type: “script”,
    Url: “https://www.youtube.com/sw.js”
}

So it’s tabId is -1.

It does get passed into sendThirdParty, and there, I try to do a browser.tabs.get(tabId) for a tabId of -1, and this is what throws an error.

In capture.js, I replaced:

 // capture third party requests
  async sendThirdParty(response) {
    const tab = await browser.tabs.get(response.tabId);
    const documentUrl = new URL(tab.url);
    const targetUrl = new URL(response.url);

    if (targetUrl.hostname !== documentUrl.hostname
&& this.shouldStore(tab)) {
        // ...

With this:

  // capture third party requests
  async sendThirdParty(response) {
    let tab;
    try {
      tab = await browser.tabs.get(response.tabId);
    } catch (err) {
      console.log(err.message, 'ahhhh');
    }
    const documentUrl = new URL(tab.url);
    const targetUrl = new URL(response.url);

    if (targetUrl.hostname !== documentUrl.hostname
      && this.shouldStore(tab)) {
        // ...

And sure enough, I get that error message as a console.log “Invalid tab ID: -1”!

The capture.shouldStore method does check for this, but I didn’t realize that browser.tabs.get(tabId), which is called before shoudStore actually throws an error when a tab’s tabId = -1! (This is for non-visible tabs).

I have now updated capture.sendThirdParty to check for this immediately and return if the tabId is -1.

Basically because we are now awaiting the return of capture.sendFirstParty and capture.sendThirdParty with the addition of async/await with IndexedDB, if one breaks and exits for whatever reason, the whole app stops working (the method never returns and we never process the next event in the queue). This wasn't the case before when subsequent calls to these methods didn’t depend on the success of the previous call...

The response object from webRequest.onResponseStarted can have a originUrl key with a value of undefined. I have no idea why. I added a ternary operator as a bandaid fix, but if you have any ideas why this is happening and what we could do instead, please share.

@jonathanKingston said:

Can you instead check if document url is correct in this case? https://www.youtube.com/sw.js could be loading lots of it's own requests.
this is odd, going to check.

@jonathanKingston said:

When we have tabId=-1 and no documentUrl, I don't think we can do anything here. (this seems to be super rare and might be related to prerendering or something)

Example of loading a sw:

"{
  "requestId": "1610",
  "url": "https://twitter.com/push_service_worker.js",
  "originUrl": "https://twitter.com/",
  "documentUrl": "https://twitter.com/",
  "method": "GET",
  "tabId": -1,
  "type": "script",
  "timeStamp": 1502443708292,
  "frameId": -1,
  "parentFrameId": -1,
  "fromCache": false,
  "ip": "104.244.42.1",
  "statusCode": 200,
  "statusLine": "HTTP/2.0 200 OK"
}"

Example of something a service worker loads:

"{
  "requestId": "1599",
  "url": "https://abs.twimg.com/favicons/favicon.ico",
  "originUrl": "https://twitter.com/mislav",
  "documentUrl": "https://twitter.com/mislav",
  "method": "GET",
  "tabId": -1,
  "type": "image",
  "timeStamp": 1502443707122,
  "frameId": -1,
  "parentFrameId": -1,
  "fromCache": true,
  "ip": "104.244.46.103",
  "statusCode": 200,
  "statusLine": "HTTP/2.0 200 OK"
}"

So we want to make sure twimg.com is a third party of twitter.com in this case. (I think however this is on the allowlist but you get the point).

@biancadanforth said:

Jonathan, here's what I understand from our conversation - please let me know if this is a correct summary:

The response.documentUrl can be undefined, though the tab.url, which is what we're using currently for data.documentUrl is never undefined. We should perform some tests to understand why documentUrl is undefined at times (as well as the other keys), and what determines its value. What are the value for these keys (documentUrl, originUrl, targetUrl) in these cases:

an iframe within a parent frame? (does it always point to the top level frame)
An iframe within an iframe
An iframe that loads a script
an iframe that loads an image
A service worker is used
If there's a chain of iframes, should we store these values as an array on documentUrl?

To find this out, I could test in the wild on real websites, or create a fake HTML page (perhaps kept in a repo) and try each specific case and see the result.

As the code currently stands, we ignore all requests for which tabId is -1. However, as Jonathan noted above, sites like Twitter could be loading third parties through Service Workers and these would be ignored. In his example, the Service Worker is making a request to abs.twimg.com, which may or may not be on our allowlist. We should make sure that even in this case, we are capturing these requests.

@biancadanforth said:

Jonathan , I have 3 questions for you on this:

Can you give me some more specifics on how I could implement the Service Worker for testing capture?

Quoting you from slack:

you should probably be able to load an image in the main tab or an iframe and it will go through the worker if you are capturing network requests. You can also make fetch requests too if the site supports CORS

So you're saying set up a worker like Archibald does in the tutorial and load a third party image that it caches?

Also, how do I look for Service Worker requests from the capture code in the wild? What are the signatures in the response object returned by webRequest.onResponseStarted that would let me identify a service worker request?
Finally, when I chain iframe elements, only the first element loads (you can try this out in my experiment):
```
<iframe src="https://www.google.com/">
<iframe src="www.npr.org">
<iframe src="https://www.reddit.com/">
</iframe>
</iframe>
</iframe>
```
In general, I think the main point here is to catch requests made by service workers. To do this, we can't automatically filter out a request based on it's tabId. The original reason why we filtered out requests with tabId -1 was to ignore devTools tabs. However, when I load devTools without applying any filters, webRequests.onResponseStarted does not capture any of those devTools requests (regardless of tabId value).

@biancadanforth said:

Jonathan, I've updated the capture code to no longer ignore non-visible tabs such as those from Service Workers. I also performed some experiments as you recommended regarding what values the response object returned by webRequest.onResponseStarted has for less common requests. You can find that experiment here; though I had some questions about it and some limitations; which I ask you about here.

@jonathanKingston said:

This is invalid code, when we discussed chaining iframes you can't just artificially make the relationships like this to stuff origins loading into other origins. This would break the webs security model too.

More like example.html

<iframe src="https://localhost:80/frame.html"></iframe>

Which would load frame1.html:

<iframe src="https://localhost:81/frame1.html"></iframe>

Which would load frame2.html:

<iframe src="https://localhost:82/frame2.html"></iframe>

@jonathanKingston said:

Anything with a -1 tab id is likely going to be suspicious however I don't think we have any other way to distinguish these.

@jonathanKingston said:

Yeah exactly that, we should be able to make a fake site that loads a worker which has a document which loads other things. Either load from within the worker or make the script load it, however you might want to detect for when the worker has loaded navigator.serviceWorker.ready before you load the images/scripts from a third party so you can guarantee the worker loads it. You can't always guarantee just by loading a service worker that it will be active and on first load it most likely isn't loaded.

If you wait for worker load you can make some dummy js createElement and inject into the body.

So as Jakes example shows you can use the fetch event to capture what a HTML document is doing and this then acts as a proxy for the website (allowing the site to mess with the request or cache, do offline etc)

@jonathanKingston said:

The original reason why we filtered out requests with tabId -1 was to ignore devTools tabs

Maybe this has changed since in Firefox I can't remember how we replicated this. We also wanted to filter out pre rendering which I still am seeing sometimes. I think there are other edge cases however most of these seem to have no document url.

@jonathanKingston said:

One thing I wanted to ensure with this research is that we are fully understanding what the properties the browser will be sending to the extension through these requests.

So it would be good to document what properties we get for each case that we listed in the meeting and perhaps update the MDN documentation if it isn't clear (which last time I checked it wasn't 100% clear on what the properties represent).

So for example we should get from this a certainty of how we will store all of the cases, we actually need to account for the cases where third parties load more third parties at some point so we can outline to the users of lightbeam what would be blocked by tracking protection (the graph will be bigger than just third parties on the list)

@jonathanKingston , here are my test results from completing my experiments:

TL;DR: Cases 1 - 3 are captured with the current logic. Case 4 (nested iframes) is not captured. Case 5 (Service Worker) is captured when the SW pulls from the network; though it is unclear if it is captured when the SW pulls a third party resource from its cache (couldn't get this to work - COR error).

Note: The requests logged here are only what is currently captured by capture.js logic.

Case 1: An iframe loads an HTML document

<iframe src="https://skillcrush.com"></iframe>

Request#/key	`response.documentUrl`	`response.originUrl`	`response.targetUrl`
1	biancadanforth.github.io	biancadanforth.github.io	skillcrush.com
2	skillcrush.com	skillcrush.com	dozens of third party sites

Conclusion: response.documentUrl can point to iframe documents, and third party requests made by iframes are captured with our current implementation. response.originUrl mirrors response.documentUrl.

Case 2: An iframe loads a script

<iframe src="https://raw.githubusercontent.com/mozilla/localForage/master/dist/localforage.min.js">
</iframe>

Request#/key	`response.documentUrl`	`response.originUrl`	`response.targetUrl`
1	biancadanforth.github.io	biancadanforth.github.io	raw.githubusercontent.com

Conclusion Loading a third party script in an iframe is treated as a third party request with the parent frame as the documentUrl and originUrl, and is captured with our current implementation.

Case 3: An iframe loads an image

<iframe src="https://lh3.googleusercontent.com/LqqPrw2sgzIW28qOm7X3tNBC5CgSTF5PBUyOQ_VUJgejbkkq6rUyPsCbsMfmYMuvAmdg_w7Dw5AE09dKVpICrSU=s0">
</iframe>

Request#/key	`response.documentUrl`	`response.originUrl`	`response.targetUrl`
1	biancadanforth.github.io	biancadanforth.github.io	lh3.googleusercontent.com

Conclusion: Loading a third party image in an iframe is treated as a third party request with the parent frame as the documentUrl and originUrl, and is captured with our current implementation.

Case 4: A chain of iframes, each loading a separate and distinct third party resource (an HTML document in this case).

Note: The page can be found and inspected here.

<!-- parent frame, scenario-4.html @biancadanforth.com -->
<!doctype html>
<html lang="en">
  <head>
      <!-- stuff --> 
  </head>
  <body>
    <iframe src="https://biancadanforth.github.io/web-request-test/iframe-chain/page.html"></iframe>
  </body>
</html>

<!-- iframe1, page.html @biancadanforth.github.io -->
<!doctype html>
<html lang="en">
  <head>
      <!-- stuff -->
  </head>
  <body>
    <iframe src="https://skillcrush.com"></iframe>
  </body>
</html>

Request#/key	`response.documentUrl`	`response.originUrl`	`response.targetUrl`
1	biancadanforth.com	biancadanforth.com	biancadanforth.github.io
2	biancadanforth.github.io	biancadanforth.github.io	skillcrush.com
3	skillcrush.com	skillcrush.com	dozens of third party requests

Conclusion: The nested iframes' third party requests are picked up by webRequest.onResponseStarted, and the current logic does capture nested iframe requests.

Case 5: A service worker is used to load a third party resource.

I set up my own service worker and hosted the fake website over HTTPS on my GitHub website. This can be found at this location; the code for the page can be found here. A back-up that uses a service worker is this app.

Request#/key	`response.documentUrl`	`response.originUrl`	`response.targetUrl`
1	biancadanforth.github.io	biancadanforth.github.io	assets-cdn.github.com

Conclusion: The third party request (to load an image from GitHub) made by the service worker is captured with the current logic. I have not been able to get a third party resource cached in the Service Worker to test this additional scenario, as I get a COR error.

@biancadanforth the nested iframe case is an issue, we need to make sure that we attribute third party requests to their top level frame first. (We could think about capturing in another property in the storage for the requests requesting document which likely would build a completely different graph) Capturing based on loading document would also solve the issue we are trying to solve here that we want to show what the decedents of the graph were loaded by tracking scripts.

@biancadanforth it would be good to create a page which links to each test if you could that would help anyone else needing to debug this rather than comment bits out.

@jonathanKingston I cleaned up the experiment.

To move each scenario to its own page, I had to change some of the sites I use (to avoid Mixed Content warnings), since I host most of the pages on GitHub which is HTTPS. I've updated the data posted above -- the results are the same, however. As noted, the data in the tables refer to only what is currently captured.

TL;DR - Another update: Case 4, chained iframes, actually are captured correctly for third parties making third party requests.

I realized that the iframe chain case (Case 4) wasn't set up properly. In my initial tests, I was linking:

biancadanforth.com has an iframe linking to biancadanforth.github.io
biancadanforth.github.io has an iframe linking to biancadanforth.github.io
biancadanforth.github.io has an iframe linking to biancadanforth.github.io

So of course it would only capture the first of 4 requests. When I update the iframe src attributes to:

biancadanforth.com has an iframe linking to biancadanforth.github.io
biancadanforth.github.io has an iframe linking to skillcrush.com (the chain has to stop here, because I don't control skillcrush.com to add an iframe to its page)

Then I do capture the chained iframe request from biancadanforth.github.io to skillcrush.com, as well as all of the third party requests (dozens) that skillcrush.com makes... I have update the table above for Case 4 to this effect.

This means all 5 cases are captured by the current logic -- does that mean issue closed, @jonathanKingston ? :D (probably not)

Please don't close this for now (especially as the task here was to allow us to distinguish the third parties loading third parties when blocking, that is the actual work here not the research).

This to me also proves that we are using the wrong key for capture. origin will be correct for third party JS or CSS files where as document will only work for frames.

Can you also capture tabId none in this research too? That way we can tell if we can establish what the top level frame is. For example if workers then load URLs that don't allow us to track what the first party was that might be an issue later.

In our current graph we try to make it shallow to make the first party the top level frame and all third parties connected to that, we also want to be able to draw/filter third party nodes that link to third parties in a more realistic diagram.

mozilla / lightbeam-we

Ensure data capture maps to what is blocked by Tracking Protection #142