Open biancadanforth opened 7 years ago
Many ad exchanges & networks use <script src="...">
and/or <iframe>
elements that make 3rd-party requests.
E.g., a site includes rubicon project <script>
tag, which adds an <iframe>
containing their cookie-syncing code, which draws more <iframe>
elements to their partners. (See line 184 of their cookie-syncing code) (copied from here to a gist for posterity.)
@biancadanforth if the first party has a service worker then it might decide to load scripts/images/css in the page which it could control in the serviceworker. This from my testing would be ignored by our code.
So for example:
Foreign fetch when implemented would make this situation worse, with wider exposure. For now please ignore this though.
Moving a discussion from PR #115 to this issue:
@biancadanforth said:
I had to add some unglamorous fixes for some old but newly salient bugs in capture.setThirdParty
. Let me know what you think. I'll let you decide whether the DNM flag should be removed.
Here's an explanation of the two capture bugs:
browser.tabs.get(tabId)
throws an error (and exits the script) if the tabId is -1!An error: Error: Invalid tab ID: -1
was showing up in the console and as soon as that happens, the WebExtension API continues to listen for and fire off events. It still calls processNextEvent
, but it never enters the if (this.queue >= 1) statement with the switch statement and the call to process the next event. In other words, after this error, this.processingQueue
is always true and ignored
is always false, so it always just gets into the return
if
statement in processNextEvent
.
Just before the error occurs, the previous processNextEvent
event was from the webRequest.onResponseStarted event; here's the response
object that caused the failure:
{
documentUrl: “http://www.youtube.com/”,
frameId: -1,
fromCache: true,
Method: “GET”,
originUrl: “https://www.youtube.com”,
parentFrameId: -1,
requestId: “887”,
statusCode: 200,
statusLine: “HTTP/2.0 200 OK”,
tabId: -1,
timeStamp: 1502310649656,
Type: “script”,
Url: “https://www.youtube.com/sw.js”
}
So it’s tabId is -1.
It does get passed into sendThirdParty
, and there, I try to do a browser.tabs.get(tabId)
for a tabId of -1, and this is what throws an error.
In capture.js
, I replaced:
// capture third party requests
async sendThirdParty(response) {
const tab = await browser.tabs.get(response.tabId);
const documentUrl = new URL(tab.url);
const targetUrl = new URL(response.url);
if (targetUrl.hostname !== documentUrl.hostname
&& this.shouldStore(tab)) {
// ...
With this:
// capture third party requests
async sendThirdParty(response) {
let tab;
try {
tab = await browser.tabs.get(response.tabId);
} catch (err) {
console.log(err.message, 'ahhhh');
}
const documentUrl = new URL(tab.url);
const targetUrl = new URL(response.url);
if (targetUrl.hostname !== documentUrl.hostname
&& this.shouldStore(tab)) {
// ...
And sure enough, I get that error message as a console.log “Invalid tab ID: -1”!
The capture.shouldStore
method does check for this, but I didn’t realize that browser.tabs.get(tabId)
, which is called before shoudStore
actually throws an error when a tab’s tabId = -1! (This is for non-visible tabs).
I have now updated capture.sendThirdParty
to check for this immediately and return if the tabId is -1.
Basically because we are now awaiting the return of capture.sendFirstParty
and capture.sendThirdParty
with the addition of async/await with IndexedDB, if one breaks and exits for whatever reason, the whole app stops working (the method never returns and we never process the next event in the queue). This wasn't the case before when subsequent calls to these methods didn’t depend on the success of the previous call...
response
object from webRequest.onResponseStarted
can have a originUrl
key with a value of undefined
. I have no idea why. I added a ternary operator as a bandaid fix, but if you have any ideas why this is happening and what we could do instead, please share.@jonathanKingston said:
@jonathanKingston said:
Example of loading a sw:
"{
"requestId": "1610",
"url": "https://twitter.com/push_service_worker.js",
"originUrl": "https://twitter.com/",
"documentUrl": "https://twitter.com/",
"method": "GET",
"tabId": -1,
"type": "script",
"timeStamp": 1502443708292,
"frameId": -1,
"parentFrameId": -1,
"fromCache": false,
"ip": "104.244.42.1",
"statusCode": 200,
"statusLine": "HTTP/2.0 200 OK"
}"
Example of something a service worker loads:
"{
"requestId": "1599",
"url": "https://abs.twimg.com/favicons/favicon.ico",
"originUrl": "https://twitter.com/mislav",
"documentUrl": "https://twitter.com/mislav",
"method": "GET",
"tabId": -1,
"type": "image",
"timeStamp": 1502443707122,
"frameId": -1,
"parentFrameId": -1,
"fromCache": true,
"ip": "104.244.46.103",
"statusCode": 200,
"statusLine": "HTTP/2.0 200 OK"
}"
So we want to make sure twimg.com is a third party of twitter.com in this case. (I think however this is on the allowlist but you get the point).
@biancadanforth said:
Jonathan, here's what I understand from our conversation - please let me know if this is a correct summary:
The response.documentUrl
can be undefined
, though the tab.url
, which is what we're using currently for data.documentUrl
is never undefined. We should perform some tests to understand why documentUrl
is undefined at times (as well as the other keys), and what determines its value. What are the value for these keys (documentUrl
, originUrl
, targetUrl
) in these cases:
documentUrl
?To find this out, I could test in the wild on real websites, or create a fake HTML page (perhaps kept in a repo) and try each specific case and see the result.
As the code currently stands, we ignore all requests for which tabId
is -1. However, as Jonathan noted above, sites like Twitter could be loading third parties through Service Workers and these would be ignored. In his example, the Service Worker is making a request to abs.twimg.com
, which may or may not be on our allowlist. We should make sure that even in this case, we are capturing these requests.
@biancadanforth said:
Jonathan , I have 3 questions for you on this:
Quoting you from slack:
you should probably be able to load an image in the main tab or an iframe and it will go through the worker if you are capturing network requests. You can also make fetch requests too if the site supports CORS
So you're saying set up a worker like Archibald does in the tutorial and load a third party image that it caches?
Also, how do I look for Service Worker requests from the capture code in the wild? What are the signatures in the response object returned by webRequest.onResponseStarted
that would let me identify a service worker request?
Finally, when I chain iframe
elements, only the first element loads (you can try this out in my experiment):
<iframe src="https://www.google.com/">
<iframe src="www.npr.org">
<iframe src="https://www.reddit.com/">
</iframe>
</iframe>
</iframe>
In general, I think the main point here is to catch requests made by service workers. To do this, we can't automatically filter out a request based on it's tabId. The original reason why we filtered out requests with tabId -1 was to ignore devTools tabs. However, when I load devTools without applying any filters, webRequests.onResponseStarted does not capture any of those devTools requests (regardless of tabId value).
@biancadanforth said:
Jonathan, I've updated the capture code to no longer ignore non-visible tabs such as those from Service Workers. I also performed some experiments as you recommended regarding what values the response
object returned by webRequest.onResponseStarted
has for less common requests. You can find that experiment here; though I had some questions about it and some limitations; which I ask you about here.
@jonathanKingston said:
More like example.html
<iframe src="https://localhost:80/frame.html"></iframe>
Which would load frame1.html:
<iframe src="https://localhost:81/frame1.html"></iframe>
Which would load frame2.html:
<iframe src="https://localhost:82/frame2.html"></iframe>
@jonathanKingston said:
@jonathanKingston said:
If you wait for worker load you can make some dummy js createElement and inject into the body.
So as Jakes example shows you can use the fetch event to capture what a HTML document is doing and this then acts as a proxy for the website (allowing the site to mess with the request or cache, do offline etc)
@jonathanKingston said:
The original reason why we filtered out requests with tabId -1 was to ignore devTools tabs
Maybe this has changed since in Firefox I can't remember how we replicated this. We also wanted to filter out pre rendering which I still am seeing sometimes. I think there are other edge cases however most of these seem to have no document url.
@jonathanKingston said:
One thing I wanted to ensure with this research is that we are fully understanding what the properties the browser will be sending to the extension through these requests.
So it would be good to document what properties we get for each case that we listed in the meeting and perhaps update the MDN documentation if it isn't clear (which last time I checked it wasn't 100% clear on what the properties represent).
So for example we should get from this a certainty of how we will store all of the cases, we actually need to account for the cases where third parties load more third parties at some point so we can outline to the users of lightbeam what would be blocked by tracking protection (the graph will be bigger than just third parties on the list)
@jonathanKingston , here are my test results from completing my experiments:
TL;DR: Cases 1 - 3 are captured with the current logic. Case 4 (nested iframe
s) is not captured. Case 5 (Service Worker) is captured when the SW pulls from the network; though it is unclear if it is captured when the SW pulls a third party resource from its cache (couldn't get this to work - COR error).
Note: The requests logged here are only what is currently captured by capture.js
logic.
Case 1: An iframe
loads an HTML document
<iframe src="https://skillcrush.com"></iframe>
Request#/key | response.documentUrl |
response.originUrl |
response.targetUrl |
---|---|---|---|
1 | biancadanforth.github.io | biancadanforth.github.io | skillcrush.com |
2 | skillcrush.com | skillcrush.com | dozens of third party sites |
Conclusion:
response.documentUrl
can point to iframe
documents, and third party requests made by iframe
s are captured with our current implementation. response.originUrl
mirrors response.documentUrl
.
Case 2: An iframe
loads a script
<iframe src="https://raw.githubusercontent.com/mozilla/localForage/master/dist/localforage.min.js">
</iframe>
Request#/key | response.documentUrl |
response.originUrl |
response.targetUrl |
---|---|---|---|
1 | biancadanforth.github.io | biancadanforth.github.io | raw.githubusercontent.com |
Conclusion
Loading a third party script in an iframe
is treated as a third party request with the parent frame as the documentUrl
and originUrl
, and is captured with our current implementation.
Case 3: An iframe
loads an image
<iframe src="https://lh3.googleusercontent.com/LqqPrw2sgzIW28qOm7X3tNBC5CgSTF5PBUyOQ_VUJgejbkkq6rUyPsCbsMfmYMuvAmdg_w7Dw5AE09dKVpICrSU=s0">
</iframe>
Request#/key | response.documentUrl |
response.originUrl |
response.targetUrl |
---|---|---|---|
1 | biancadanforth.github.io | biancadanforth.github.io | lh3.googleusercontent.com |
Conclusion:
Loading a third party image in an iframe
is treated as a third party request with the parent frame as the documentUrl
and originUrl
, and is captured with our current implementation.
Case 4: A chain of iframe
s, each loading a separate and distinct third party resource (an HTML document in this case).
Note: The page can be found and inspected here.
<!-- parent frame, scenario-4.html @biancadanforth.com -->
<!doctype html>
<html lang="en">
<head>
<!-- stuff -->
</head>
<body>
<iframe src="https://biancadanforth.github.io/web-request-test/iframe-chain/page.html"></iframe>
</body>
</html>
<!-- iframe1, page.html @biancadanforth.github.io -->
<!doctype html>
<html lang="en">
<head>
<!-- stuff -->
</head>
<body>
<iframe src="https://skillcrush.com"></iframe>
</body>
</html>
Request#/key | response.documentUrl |
response.originUrl |
response.targetUrl |
---|---|---|---|
1 | biancadanforth.com | biancadanforth.com | biancadanforth.github.io |
2 | biancadanforth.github.io | biancadanforth.github.io | skillcrush.com |
3 | skillcrush.com | skillcrush.com | dozens of third party requests |
Conclusion:
The nested iframe
s' third party requests are picked up by webRequest.onResponseStarted
, and the current logic does capture nested iframe
requests.
Case 5: A service worker is used to load a third party resource.
I set up my own service worker and hosted the fake website over HTTPS on my GitHub website. This can be found at this location; the code for the page can be found here. A back-up that uses a service worker is this app.
Request#/key | response.documentUrl |
response.originUrl |
response.targetUrl |
---|---|---|---|
1 | biancadanforth.github.io | biancadanforth.github.io | assets-cdn.github.com |
Conclusion: The third party request (to load an image from GitHub) made by the service worker is captured with the current logic. I have not been able to get a third party resource cached in the Service Worker to test this additional scenario, as I get a COR error.
@biancadanforth the nested iframe case is an issue, we need to make sure that we attribute third party requests to their top level frame first. (We could think about capturing in another property in the storage for the requests requesting document which likely would build a completely different graph) Capturing based on loading document would also solve the issue we are trying to solve here that we want to show what the decedents of the graph were loaded by tracking scripts.
@biancadanforth it would be good to create a page which links to each test if you could that would help anyone else needing to debug this rather than comment bits out.
@jonathanKingston I cleaned up the experiment.
To move each scenario to its own page, I had to change some of the sites I use (to avoid Mixed Content warnings), since I host most of the pages on GitHub which is HTTPS. I've updated the data posted above -- the results are the same, however. As noted, the data in the tables refer to only what is currently captured.
TL;DR - Another update: Case 4, chained iframes, actually are captured correctly for third parties making third party requests.
I realized that the iframe chain case (Case 4) wasn't set up properly. In my initial tests, I was linking:
So of course it would only capture the first of 4 requests. When I update the iframe src
attributes to:
Then I do capture the chained iframe request from biancadanforth.github.io to skillcrush.com, as well as all of the third party requests (dozens) that skillcrush.com makes... I have update the table above for Case 4 to this effect.
This means all 5 cases are captured by the current logic -- does that mean issue closed, @jonathanKingston ? :D (probably not)
Please don't close this for now (especially as the task here was to allow us to distinguish the third parties loading third parties when blocking, that is the actual work here not the research).
This to me also proves that we are using the wrong key for capture. origin will be correct for third party JS or CSS files where as document will only work for frames.
Can you also capture tabId none in this research too? That way we can tell if we can establish what the top level frame is. For example if workers then load URLs that don't allow us to track what the first party was that might be an issue later.
In our current graph we try to make it shallow to make the first party the top level frame and all third parties connected to that, we also want to be able to draw/filter third party nodes that link to third parties in a more realistic diagram.
Right now, our capture code compares
documentUrl
totargetUrl
for an HTTP response object to determine whether or not a third party request is being made.This does not take into consideration, for example, that some third parties can also make their own third party requests, potentially in a chain.
We should consider how we are capturing this data, and if we are leaving out some requests, including storing request chains (i.e. we may want to store some of our keys like
documentUrl
as an array, rather than a single key-value pair).@jonathanKingston , could you elaborate more on some of these less obvious ways that third party requests can be made, and the best way to check for them?