privacy-tech-lab / gpc-optmeowt

Privacy browser extension for opting out from web tracking via GPC
https://www.privacytechlab.org
MIT License
153 stars 16 forks source link

Implement the IAB CCPA Compliance Framework #17

Closed SebastianZimmeck closed 4 years ago

SebastianZimmeck commented 4 years ago

Depending a bit on how things are developing on the policy end (i.e., whether we find some support for our signal, ideally in terms of standardization), we should consider also implementing the IAB CCPA Compliance Framework (should it turn out that there is not a whole lot of support for our signal). Their US Privacy String follows a similar idea as our signal, is technology-agnostic, and can be sent via a browser extension. It would be binding for companies participating in the IAB.

Here are some further references:

SebastianZimmeck commented 4 years ago

I think we should to this actually. Per today's final CCPA Regs, it is clear that technological solutions for opting out from the sale of personal information will be developed. Members of the IAB will probably implement the IAB approach. So, if our extension could hand this as well, that would be great.

Probably, the fist order of business should be to find some example sites that are participating in the IAB approach. The Privacy String is technology-agnostic. So, how are people implementing this in practice? Ideally, it would be header-based solution as well.

SebastianZimmeck commented 4 years ago

This also relates to the enhancement of the IAB CCPA framework.

SebastianZimmeck commented 4 years ago

I thought about this some more. We should actually implement the Do Not Sell opt out functionality per the IAB CCPA framework. This will be the first real functioning opt out for all sites that make use of IAB CCPA framework. Here is the plan:

The IAB Tech Lab U.S. Privacy String consists of four characters and is very similar to what we are actually doing. Here is an example:

Example 2 meets the following conditions:

Version 1 of the US Privacy string is being used. (1)
The digital property has NOT provided explicit user notice. (N)
The user has made a choice to opt out of sale. (Y)
The digital property is not operating under the Limited Service Provider Agreement. (N)

1NYN

How the four character string is transmitted to the server is principally technology-agnostic. However, the IAB recommends storage in a first party cookie:

The recommendation is to store the String in a first-party cookie named "usprivacy" where the API library can read it and write to it. In case storing on a 1st party cookie is not possible or practical (such as on mobile native or if cookies are disabled), a different storage method should be adopted.

What I am hoping is that our extension can identify the usprivacy cookie and rewrite its value, most importantly, with a Y as the third of the four-character string. Here is the EditThisCookie browser extension that manages to get access to the cookies of the currently visited website. The user can then edit its values. (As an aside, let's not copy any code directly from the extension. It comes under the GNU GPL license and would "infect" our code. However, we can use the same APIs to identify the cookie and rewrite its value. Just not directly copy the EditThisCookie code.)

Here is a screenshot of using the EditThisCookie extension on https://psychologist.onl/

Screen Shot 2020-06-28 at 8 05 42 PM

It would be great if we could automatically rewrite cookie values like this to, say, 1NYN.

Now, this is probably a somewhat tricky task, especially, synching this with the whitelist (let's not worry about that for the time being). We may want to split this into multiple issues. For the time being, I am leaving this together, though.

kalicki1 commented 4 years ago

I started working on implementing the IAB CCPA guidelines for opting-out in our extension and managed to get a rudimentary cookie-modifier functioning. I haven't uploaded the code yet just because it is heavily based on how AccuWeather has implemented the framework (they have an 'opt-out' link in the site footer that links to a us_privacy cookie for site visitors to use), in order to rule out as many variables in the implementation process as possible. The rough JS is pasted below. I originally had this function run every time after our original extension modified a given site's request headers, so this function definitely runs too many times for its intended purpose in my current implementation.

chrome.cookies.get({ 
      "name": 'us_privacy', // Make this not case-sensitive
      "url": 'https://www.accuweather.com/'
    }, 
    function (cookie) {
      if (cookie !== null) {
        let new_cookie = cookie
        new_cookie.value = '1YYN'
        new_cookie.url = 'https://www.accuweather.com/'
        new_cookie.domain = null;
        if (new_cookie.hostOnly !== null) {
          delete new_cookie.hostOnly
        }
        if (new_cookie.session !== null) {
          delete new_cookie.session
        }
        chrome.cookies.set(new_cookie, function (details) {
            console.log("Found and updated cookie value.")
          })
      } else {
        console.log("COOKIE NULL")
      }
})

A few notes

So the leading dot in chrome don't reflect whether or not a leading dot was used from the server, but whether or not that cookie had a "Domain=something" in its definition from the server. (And if it had, the cookie will also be sent to sub-domains).

Basically, it says a domain value of null would set the cookie's hostOnly value to true by the browser. According to the Chrome API documentation, this means that hostOnly is

True if the cookie is a host-only cookie (i.e. a request's host must exactly match the domain of the cookie).

To me it looks like this is an arbitrary choice on AccuWeather's part, though there could be some other reasons they set the domain this way that we don't know about. When I implement this I can just make sure to make a note of how a given site handles its cookie domain somewhere so that we only ever have one copy of the us_privacy cookie. However, this could get confusing if we are the ones setting a cookie and not reading one already added to the site's storage, so I think we will have to add a function that checks for multiple signals and handles them somehow.

Overall

In general, the notes above are for future reference so my thought process on how to implement the IAB proposal is documented somewhere.

@SebastianZimmeck, I will start to generalize this implementation and break it up into more fleshed out and manageable chunks over the next few days. Let me know what you think about what I have found.

SebastianZimmeck commented 4 years ago

heavily based on how AccuWeather has implemented the framework

If it is not straight copied, it is OK in terms for copyright. You can certainly use the same APIs and the code can look similar.

I originally had this function run every time after our original extension modified a given site's request headers, so this function definitely runs too many times for its intended purpose in my current implementation.

If it is not a drag on performance, that would be OK. Maybe, it is possible to identify request with a cookie. The function would only need to be run for that.

The points you describe with the hostOnly and other settings seem tricky. Perhaps, we can discuss this more tomorrow.

I will start to generalize this implementation and break it up into more fleshed out and manageable chunks over the next few days. Let me know what you think about what I have found.

Sounds good. Generally, this looks promising to me.

kalicki1 commented 4 years ago

heavily based on how AccuWeather has implemented the framework

Maybe I misspoke when I said this. What I meant is that since AccuWeather does set and use the us_privacy first-party cookie, I tested out my code on the site to see if I could set up some sort of "communication," even if they didn't respond. In other words, could I see that they created a cookie on my browser and then could I modify that specific cookie.

The rest does look promising, hopefully this is the right way forward!

SebastianZimmeck commented 4 years ago

A few more sites that can be used for testing as they are using the IAB CCPA Compliance Framework.

Here is a site that lists various other domains having the usprivacy cookie. It does not seem to be always correct, though.

kalicki1 commented 4 years ago

The recent commit I pushed has functionality for sending a chosen cookie to every single site visited by a user. So far, there is no implemented functionality to check if a cookie is already on the site, and if there was one, for it to be parsed and responded to accordingly. I will continue to work on this alongside my suggestion below as well as issue #42.

@SebastianZimmeck, after reading your updates to issue #42, I was thinking that we could use the idea of storing known ad networks' cookie profiles to a JSON and extend it slightly to allow us to store known variations of the us_privacy signal, such as us_privacy, usprivacy, etc. This way, when we check if a site follows the IAB protocol, we can respond to the variation of the cookie the site has already implemented. I am mentioning this because the sites you listed above seem to be setting their own usprivacy cookie, while I have been developing according to Accuweather's preference for us_privacy.

SebastianZimmeck commented 4 years ago

we could use the idea of storing known ad networks' cookie profiles to a JSON and extend it slightly to allow us to store known variations of the us_privacy signal, such as us_privacy, usprivacy, etc.

Good idea. Per the IAB specification, it should be us_privacy (bottom of the page). However, it certainly may be the case that some implementers use a slightly different format. Some possibilities are usprivacy, us-privacy, and us_privacy.

At the moment, I am thinking that it may be the best if you are creating two different JSON specs; one with the variations of the us_privacy string and one for the concrete ad network server URLs to visit (per issue #42). These are slightly different things (though, both could use the JSON spec idea, indeed).

kalicki1 commented 4 years ago

In the newest code update, I added:

What we need to work on next is handling the case where multiple cookies exist in the browser for some reason. Our extension avoids creating new cookies on AccuWeather, however this is not true for system.jobboard.io. When there is no cookie set and the page is loaded, the extension seems to get the opportunity to set its own before the site does, and the site apparently does not recognize the one we set by then. This can be seen here:

Screen Shot 2020-07-08 at 11 17 02 PM

In this case, two different IAB cookies are set, one by us and one by system.jobboard.io. Since our cookie update process runs multiple times per site load as of right now (it runs every time a header is modified), we can resolve this with a check for multiple cookies and then delete the one containing our default settings in a subsequent script call. I feel that this kind of a check for multiple cookies can get quite complex however. Maybe we can find a way to avoid this in the first place altogether.

Despite this needed fix, it seems like the core functionality is in place! I will continue to test this while I work on other parts of the extension, since it seems there will be a few minor bugs that will need to get worked out.

SebastianZimmeck commented 4 years ago

In this case, two different IAB cookies are set, one by us and one by system.jobboard.io.

So, our cookie is set on the domain, jobboard.io, and the site sets the cookie on the subdomain, sytem.jobboard.io? Are the two different domains the problem? In other words, would there be only one cookie written if we would also write the cookie to the subdomain?

Since our cookie update process runs multiple times per site load as of right now (it runs every time a header is modified), we can resolve this with a check for multiple cookies and then delete the one containing our default settings in a subsequent script call. I feel that this kind of a check for multiple cookies can get quite complex however. Maybe we can find a way to avoid this in the first place altogether.

As I see it at the moment, I do not think that it is a big problem that there are cookies in the domain and subdomain(s). Especially, if the site relies on setting and reading the cookie from multiple (sub)domains (not sure, is that the case?), it may be even necessary to have multiple cookies. If we can figure out exactly where the cookies are set and read, we can delete it. If we are not quite sure, I think it is OK to have multiple cookies. What would be important, though, is to have consistent values for these cookies.

kalicki1 commented 4 years ago

So, our cookie is set on the domain, jobboard.io, and the site sets the cookie on the subdomain, sytem.jobboard.io?

No, our cookie is the one set to the subdomain system.jobboard.io. When a new cookie needs to be made from scratch, our extension abstains from setting a specific domain. Chrome fills out this information itself based on the current URL, which gives us system.jobboard.io, in this case a specific subdomain. However, it looks like the site sets its own IAB cookie to the domain .jobboard.io and not the subdomain Chrome assigned our cookie.

Fundamentally, it looks like this is the problem yes. You need to keep the name and the domain the same to overwrite a cookie. This is how our extension overwrites the cookie if jobboard.io places its own cookie first, something we do not have an issue with at the moment. The extension recognizes the cookie is there and then saves the site-assigned domain from that cookie at this point in the code.

It is important to note that in this case, the site-assigned IAB cookie's name is different than the one we assigned. This fact alone necessitates some kind of check for multiple cookies aside from the cookie domain issue.

Especially, if the site relies on setting and reading the cookie from multiple (sub)domains (not sure, is that the case?)

This doesn't seem to be the case. Multiple cookies are set because whoever runs the given website doesn't check to see if another variation of the cookie exists (ours if we set our cookie first), albeit with slightly different parameters than the ones they chose to give it. Since the site won't handle it, our extension needs to be vigilant in such cases and make sure to use the same settings of the site-assigned cookie. This is at least my thinking at the moment.

it may be even necessary to have multiple cookies

My only concern with this is that I believe the IAB protocol mentions only one cookie should be used by a site and a user to mutually exchange the opt-out information. This leads me to think if a user doesn't modify a site-provided IAB cookie, most site owners will not check for other variations of the cookie in the same way we do. Though this could be open to interpretation, I think I would prefer to keep only one IAB cookie per site for this reason.

kalicki1 commented 4 years ago

Here are some thoughts I have on a few rough ideas we could implement.

Making a function to 'guess' what domain to use If it comes down to it, we could also create some sort of function to 'guess' the best domain to use when setting a us_privacy cookie if ones does not exist. We could collect information on all the other cookies set by a site, average the number of times a particular domain shows up, and select the most often occurring one to set as our given us_privacy domain. This could possibly increase the chances a site will recognize our cookie, though we have now way of knowing for sure.

Creating cookies for each version of the us_privacy signal Though I would really prefer to not do this, I think we do have the option to set a cookie for each variation of the IAB signal that exist (us_privacy, us-privacy, usprivacy). This way, we have three identical copies of the signal, each with a different name, in case a given site only checks for one. Until the IAB spec is clarified or many sites clearly adopt one or another, we will not know which default name to use.

kalicki1 commented 4 years ago

Recent commit regarding multiple IAB cookies The recent commit here removes the default cookie placed by the extension if, when the extension is called again, it recognizes multiple IAB cookies on a given URL. It does not guarantee that there will not be multiple cookies at all, but does solve the specific issue with loading our own cookie before a site gets to load their own as discussed above with jobboard.io. Since this solution only deletes one cookie, if there are three or more IAB cookies for some reason, the current URL will still have more than one cookie after this patch runs.

The big picture is that, if no cookie exists, we will place one. In the case of jobboard.io, they always end up placing their own cookie immediately after we place ours, though with different enough settings that it doesn't override the one we placed. Hence we end up with multiple cookies. This patch doesn't prevent this from occurring in the first place, but rather resolves it in a subsequent pass of the extension.

SebastianZimmeck commented 4 years ago

As discussed, @kalicki1 will continue with his testing (and possibly open new issues and close this one as the concrete work becomes more clear). In principle, there are two approaches:

  1. Write cookies for different domains and with different names; one of them will be correct
  2. Identify which domain and name is the correct one
kalicki1 commented 4 years ago

Since I want to move development along in other areas of the extension and not spend too much time focused only on this issue, I will open a pull request to bring the changes made so far on this IAB CCPA implementation into the master branch. I will do the same with issue #42 to test the cookie-based code side by side and find ways to simplify the code base if possible.

If major issues surface or revisions need to take place regarding this IAB spec implementation, I will open new, focused issues that address them specifically. Seeing as the major goal of this issue is now complete, we can close this issue as well. We can continue to use this issue as a reference to problems we resolved in the past if new issues in the IAB framework come up.