privacy-tech-lab / gpc-web-crawler

Web crawler for detecting websites' compliance with GPC privacy preference signals at scale
https://privacytechlab.org/
MIT License
4 stars 2 forks source link

Develop new methodology for CCPA and Do Not Sell/Share/Targeting/Profiling applicability #59

Closed SebastianZimmeck closed 1 week ago

SebastianZimmeck commented 1 year ago

We have difficulties with our Do Not Sell link identification.

  1. We have difficulties improving the accuracy of the Do Not Sell link identification.
  2. Also, there are inherent limitations, i.e., we completely miss sites that are subject to the CCPA's Do Not Sell requirement that do not post a Do Not Sell Link at all.
  3. Do Not Sell link identification may even get less relevant if laws are moving in the direction of "if you accept GPC, you do not need to post a Do Not Sell link." Given this situation, could we do something better?

@OliverWang13, can you take the lead on exploring the following idea (with the help of @katehausladen, as needed)?

A two-part approach would be to first identify how many visitors a site has and secondly whether it integrates ad networks that qualify as buyers of data. In more detail:

1. Is the CCPA applicable?

First, here is the law for applicability of the CCPA (and other state laws are/will be similar). For the CCPA to be applicable, we need a "business." Here are the requirements per CCPA 1798.140. (d):

"Business" means:

(1) A sole proprietorship, partnership, limited liability company, corporation, association, or other legal entity that is organized or operated for the profit or financial benefit of its shareholders or other owners, that collects consumers’ personal information, or on the behalf of which such information is collected and that alone, or jointly with others, determines the purposes and means of the processing of consumers’ personal information, that does business in the State of California, and that satisfies one or more of the following thresholds:

(A) As of January 1 of the calendar year, had annual gross revenues in excess of twenty-five million dollars ($25,000,000) in the preceding calendar year, as adjusted pursuant to paragraph (5) of subdivision (a) of Section 1798.185.

(B) Alone or in combination, annually buys, sells, or shares the personal information of 100,000 or more consumers or households.

(C) Derives 50 percent or more of its annual revenues from selling or sharing consumers’ personal information.

2. Is the CCPA's Do Not Sell requirement applicable?

Second, once we know the CCPA is applicable, is the site selling? Per the Sephora Complaint:

[I]f companies make consumer personal information available to third parties and receive a benefit from the arrangement—such as in the form of ads targeting specific consumers—they are deemed to be "selling" consumer personal information under the law.

In other words, targeting ads based on location or other consumer data collected by a third party on a site constitutes a sale by the first party.

3. Misc

OliverWang13 commented 1 year ago

I've been exploring these ideas a bit.

1.

For finding sites with a significant portion of California users, I am unsure if there is any good service. If we had access to all of the site's google analytics (which is likely impossible), we could find it there. There are also web analysis services like similarweb or Alexa that could tell us a lot of their information, but I have not seen whether you can look by specific region. For similarweb with a student account, you cannot, but I sent them a message asking if that is available with a paid version and will see how they get back to me.

Using the first google numbers that pop up, California makes up about 11% of the US population, so we could estimate that a site with over 1,000,000 users has 100,000 California users. In reality, I think that a lot of large sites will be implementing these changes just due to the chance that they could possibly have a significant number of California users at some point.

I am not really sure what public information surrounding companies exists and I will have to look into this a bit more.

2.

I think we could be able to detect ad networks by using our current DNSL finder methods of analyzing the network requests but it could be difficult to tell whether the site is using targeted ads or not. In theory, I feel that this should not be incredibly difficult but it could be more complicated than I expect. Possibly, we could also investigate the ad networks themselves and see if it is standard practice to use targeted ads, and then see which sites use which. Or, perhaps, are targeted ads really the industry standard? I also noticed a visual notice of ads on some sites that we could possibly search for. For example, on NYT:

Screen Shot 2023-07-24 at 3 05 27 PM

Do Not Sell link identification may even get less relevant if laws are moving in the direction of "if you accept GPC, you do not need to post a Do Not Sell link." Given this situation, could we do something better?

To this point, we could possibly try identifying the GPC code itself, if that is something that could be found in the network requests. This would mostly be possible if it is done in a uniform manner. Otherwise, it could suffice just to see which sites appear to respect GPC (usp string changes to 1-Y-).

SebastianZimmeck commented 1 year ago

I think we could be able to detect ad networks by using our current DNSL finder methods of analyzing the network requests but it could be difficult to tell whether the site is using targeted ads or not.

Instead of reinventing the wheel, could we not integrate an open source ad/tracking blocker/transparency extension (or part of it)? Alternatively, it may be easier to not do that but do a separate crawl with one of these extension (i.e., create a Firefox instance with such extension and crawl with it being turned on). As it happens, Privacy Pioneer is an extension that records trackers. :) Try it out.

Possibly, we could also investigate the ad networks themselves and see if it is standard practice to use targeted ads, and then see which sites use which.

We could compile a list of ad networks that target ads based on their documentation. Also, @katehausladen's Facebook method may help. If we see a site we crawled later on Facebook, we know that there some targeting going on. Probably, not all ad networks target ads, but I would think most do.

To this point, we could possibly try identifying the GPC code itself, if that is something that could be found in the network requests.

I am not sure. We probably find GPC-related code. For example, we would find OneTrust's GPC code if a site integrates it. But that would lead essentially to the same result as builtwith.com's approach. We know there is the potential that GPC may be turned on, but we do not know for sure. Maybe, it can be done, though. Maybe, take a look at a few sites that have GPC integrated and check what the web requests look like. Is there anything in those to tell for sure whether a site integrates GPC?

SebastianZimmeck commented 1 year ago

Just a comment:

OliverWang13 commented 1 year ago

I tried out privacy pioneer and it definitely seems like a tool we could use to see whether a site is subject to the CCPA, likely in conjunction with some other tools as well (In addition, seems very well made).

So far I have manually visited a few sites and looked for GPC related code. On each site I will inspect the page, hit command+option+f (on a mac) and see if they have code that ever checks the navigator.globalPrivacyControl item. For now, I can not tell whether having that information is a sure way of if they respect GPC or not but I would not be surprised if it was not indicative of that at all. Either way, we could be able to glean some sort of information from the presence of this code. Perhaps, that the site is subject to the CCPA? But if this code was missing, we still would not be able to see if a site should be subject to the CCPA. I will take a look at Kate's analysis results and see if there are patterns for respecting GPC and including the GPC code.

katehausladen commented 1 year ago

I also think using privacy pioneer would be a good idea. I think it records all the things that indicate that it would show up in the Facebook data. I went ahead and integrated it into our crawler just to see how it works. I haven't done any larger-scale testing on it yet, but it seems to integrate nicely. @OliverWang13 if you want to try it out, that would be great.

I have our crawler downloading the json of data once at the end as well as before the driver restarts when an error occurs. I put the privacy pioneer xpi in the commit above, but I got it by going here, right clicking on the "Add to Firefox", and using the "Save Link As..." to save it as an xpi file. We should only need to update the xpi file whenever there's a new release of privacy pioneer.

SebastianZimmeck commented 1 year ago

I went ahead and integrated it into our crawler

Nice!

OliverWang13 commented 1 year ago

I tested @katehausladen's code out and it seemed like it worked pretty well but the crawler got stuck on a few sites and would not proceed so I had to quit it, remove the sites from the list that were analyzed and restart the crawler. At first, I thought it was just because my computer was slow and possibly overheating, but I restarted it and closed all nonessential operations and it still was getting stuck. To make sure, I then ran a normal crawl in the main branch and everything was working fine. @katehausladen did you get any behavior similar to this as well? I was just crawling on a random subset of our sites list.

Some of the sites I was getting stuck on:

katehausladen commented 1 year ago

I actually did have a similar behavior. I thought it was due to a new Firefox version, but if the main branch is working fine for you, then that's not the case. Maybe we can try running a crawl with only Privacy Pioneer (i.e. just don't install our extension) and see how that goes. Or try commenting out the Privacy Pioneer code in the branch to see if I somehow messed up our extension in the branch.

SebastianZimmeck commented 1 year ago

Not sure if related, but some of the analysis that Privacy Pioneer does is based on BERT, a large language model, which is naturally computationally-intensive.

SebastianZimmeck commented 1 year ago

Especially, if the slowdown occurs on sites that have more traffic and less on sites that have less, that could be the reason. In that case, there is not much that can be done.

OliverWang13 commented 1 year ago

I looked into detecting GPC related code to see if it would be worthwhile. I crawled 100 sites from our validation list and then manually checked whether they had GPC code to see if there were any patterns that emerged. The results are as follows:

At first glance, I do not think that there is anything extremely useful that could be found if we searched for GPC code, likely meaning that we would end up with a list extremely similar or identical to the one that BuiltWith uses. For more in depth results the sheet I used to document the findings is here

SebastianZimmeck commented 1 year ago

Good work!

katehausladen commented 1 year ago

In issue #64, Oliver requested 2 main changes to PP:

Implement a simple SQL database structure to make it more robust for large-scale crawling

Reduce the amount of information reported (and possibly reduce the amount of information collected if it does not mess with the core functionality in a negative way)

I'll explain in terms of our code (particularly my last commit) why he suggested these changes.

When crawling with PP and our extension at the same time, our success rate goes down to ~90% (from ~98% when running our extension only). This seems to primarily be due to increased frequency of the error "webdrivererror: failed to decode response from marionette". I think this error basically means that Firefox has failed because it is overwhelmed.

Oliver thought that maybe if we decrease the amount of data collected, then PP would stop overwhelming Firefox.

The other problem with the marionette error is it impacts our data collection from PP. We get the PP data by downloading it (1) before restarting the crawler after an error has occurred and (2) when all sites have been visited. When the marionette error happens, the driver automatically kills the session. Because the session no longer exists, we can't download the PP data, which means we would lose the PP data for all sites between the last start of the crawler and this error. This is why we would want a database if possible.

SebastianZimmeck commented 1 year ago

Implement a simple SQL database structure to make it more robust for large-scale crawling

As discussed, in principle, I see no concern here of switching from JSON to a local SQL database.

Reduce the amount of information reported (and possibly reduce the amount of information collected if it does not mess with the core functionality in a negative way)

As discussed, it is probably hard to reduce functionality as we need it for PP. But possibly we only need certain datapoints, depending on what @OliverWang13 finds. So, exporting just that could be an option. Alternatively, having a switch in the code turning on "performance mode" that collects less data could possibly be an option.

@danielgoldelman, @jjeancharles, @JoeChampeau, and @JustinCasler, please feel free to chime in. As additional background info, this is not impacting what we are currently doing but is a longer-term issue. The overall point here is to leverage PP for GPC purposes, which, as @katehausladen and @OliverWang13 point out, is currently running into some issues. @katehausladen and @OliverWang13 will continue exploring and testing.

katehausladen commented 1 year ago

I added a table to our DB for privacy pioneer data collection. The PP extension in the commit (myextension-pp.xpi) posts PP data to the pp_analysis table in addition to storing the data the way it originally did.

To create the table, use:

CREATE TABLE pp_analysis (id INTEGER PRIMARY KEY AUTO_INCREMENT, timestamp_ BIGINT, permission varchar(255), rootUrl varchar(255), snippet varchar(4000), requestUrl varchar(4000), typ varchar(255), index_ varchar(255), parentCompany varchar(255), watchlistHash varchar(255), extraDetail varchar(4000), cookie BOOLEAN);

I think we should separate crawling with PP and OptMeowt because selenium/firefox seem to fail less when they're separated. The code is largely the same for crawling with PP and OptMeowt, so I kept one crawling file and rest api. Now, by default, the crawler and rest api will crawl with OptMeowt, but you can specify to use PP instead using command line arguments. To use the PP version, run node local-crawler.js privacy-pioneer for the crawler and node index.js privacy-pioneer for the rest api. To use the regular OptMeowt version, run node local-crawler.js for the crawler and node index.js or node index.js debug for the rest api as usual.

We can discuss what data we actually need to store and if we want to filter it before (by strategically placing our post requests in PP) or after it is posted to the database.

katehausladen commented 1 year ago

The updates to PP are in the sql-database branch. Basically, it's just adding a function that posts data using axios and then calling it in a few places. At first glance, it looks like PP runs the same whether or not the axios requests actually go through, but I’ll do more in-depth testing on this. If the performance changes (or even just to be safe), we should add some kind of toggle switch to turn the axios post requests on/off.

SebastianZimmeck commented 1 year ago

I added a table to our DB for privacy pioneer data collection. The PP extension in the commit (myextension-pp.xpi) posts PP data to the pp_analysis table in addition to storing the data the way it originally did.

Excellent!

I think we should separate crawling with PP and OptMeowt because selenium/firefox seem to fail less when they're separated.

Sounds good!

To use the PP version, run node local-crawler.js privacy-pioneer for the crawler and node index.js privacy-pioneer for the rest api.

Excellent! Let's add this info to the documentation once we bring in the changes to main as you describe.

katehausladen commented 1 year ago

I added a toggle so that axios post requests are not sent by default. It's just a variable called sql_db in addEvidence.js set to false. If we want to repack privacy pioneer (which only needs to happen if there are major updates to PP), we can just change sql_db to true before packing. If sql_db = false, then the only difference to PP is that now axios is in its package.json; all other code added won't do anything.

SebastianZimmeck commented 1 year ago

Excellent! I'd say, let's bring that to the PP main branch. If you could prepare a PR to that end, that would be great, @katehausladen. (The team is working on a paper right now, so response may take two weeks :)

SebastianZimmeck commented 1 year ago

Last we discussed possibly using some machine learning methods to understand whether a site is subject to (1) the CCPA (or other privacy law) and/or a (2) Do Not Sell requirement.

One idea we discussed is to apply a classifier to the web traffic. Maybe, the types of third parties, amount of traffic, or other properties could enable above inferences.

One other thought: Is there a generative AI API (at no or low cost) that we could query and that would provide the answers to the above? We would connect to the API, get the answer, then do our usual business (checking privacy flags, sending GPC signal, re-checking privacy flags).

Maybe, there are also other ways using a generative AI API here.

OliverWang13 commented 1 year ago

The link to the sheet is here. In the "selling" column a few entries have question marks because I was unsure about whether they are selling or not. I will be going through and re-evaluating those shortly. Let me know if there are any questions about the sheet.

katehausladen commented 12 months ago

I looked a bit into using ChatGPT and Bard to determine whether sites were selling based on the site's privacy policy the CCPA definition of selling. There might be a way this could work, namely if we give it the text from the privacy policies in the prompt and then ask. We will not be able to just provide site URLs or links to privacy policies and it to determine whether that site is selling. ChatGPT: The main problem with ChatGPT is it does not have access to the internet. That means that it cannot access privacy policies and is drawing conclusions based on its training data from September 2021 or before. Some prompts and responses: In this first one, it seems overly focused on what the main purpose of the site is. Screenshot 2023-09-08 at 11 18 29 AM Also, every subsequent time I tried to ask the same prompt, it returned this response: Screenshot 2023-09-08 at 11 22 20 AM

Bard: Bard has a similar problem. It does have internet access, but it still cannot access privacy policies in real time Google Search. Here are prompts I used.

Screenshot 2023-09-08 at 11 09 00 AM

This looks promising, but when I checked its "sources" for those phrases, they weren't there. So, I asked it to site its sources, which it could not do.

Screenshot 2023-09-08 at 11 09 55 AM

I'll continue looking into this to see what happens when I feed it text from a privacy policy and ask it to decide.

katehausladen commented 12 months ago

I tested the idea of feeding Bard/ChatGPT privacy policies and asking whether the site sells. Accuracy and consistency of answers are the 2 main issues. I don't think that using either of the AIs will be a viable option for determining whether sites are subject to the CCPA.

More details: Bard: When I fed Bard the privacy policy of Cigna found here, this is what it returned. (Cigna does sell)

Screenshot 2023-09-10 at 12 11 19 AM

I checked the privacy policy for the source sentence, and it didn’t exist. I pointed this out to Bard, and it corrected itself.

Screenshot 2023-09-10 at 12 11 39 AM

With these kinds of incorrect outputs, we definitely cannot rely on Bard.

ChatGPT: Attempt 1: Give it a full privacy policy and ask if to determine if the site sells. Problems: It would often return inconclusive answers, even if it was clear that the site was selling. Then I’d argue with it, and it would correct itself based on my argument. Example failure: (I used U-Haul, which does sell) Prompt:

U-haul full privacy policy prompt

Response + me arguing:

U-Haul reponse

Attempt 2: Give it the specific sentence from the Privacy Policy that indicates selling and ask it to determine if the sites sells. Problems: This also failed. It would often give wrong or inconclusive answers. Also, even if I explicitly said to return a yes or no answer, it often would write me a paragraph (not shown in example below) Example failure:

point blank ask

Attempt 3: Ask it to identify and return any sentences in the following text that pertain to selling data under the CCPA definition of selling information (I hoped that maybe it could still help us parse Privacy Policies.) Problems: It would often miss sentences that indicate selling or find none at all. Example failure: Prompt:

Screenshot 2023-09-10 at 3 16 58 PM

Response:

Screenshot 2023-09-10 at 3 17 06 PM
SebastianZimmeck commented 12 months ago

These are some friendly chatbots, for sure, but a bit clueless. :)

OK, I think, we call it a day here.

OliverWang13 commented 12 months ago

My testing with Chat GPT produced similar results as well. In addition, having gone through many privacy policies by hand, there are often contradictions, making it difficult to conclude whether a site is selling based on the privacy policy, even for a human.

With the work that I did analyzing the Privacy Pioneer network requests, it looks like the vast majority of sites are selling data. The quantity of advertising networks also has a direct correlation with whether a site is selling or not, although there are outliers at times. Looking through the disconnect list, their methodology for finding whether a site sells seems reliable and leads me to believe that the Privacy Pioneer "advertising" label is accurate and possibly valuable. Of course, the bigger issue still is that some sites may use ad networks but may be using an option that does not use contextual advertising, which the disconnect list does not seem to mention. I believe that sites that perform this way are a small minority, so using the disconnect list and privacy pioneer would rarely give us a false positive for whether a site sells or not.

Another method I could explore is navigating to a home page, automatically finding the privacy policy, and then searching for keywords using regex to try and conclude whether a site is selling or not. Possible phrases could be "targeted advertising" or "CCPA" or "opt out of sale." This could be a little challenging due to the confusing nature of some privacy policies, but could possibly work quite well without having to rely on any outside extensions.

SebastianZimmeck commented 12 months ago

In addition, having gone through many privacy policies by hand, there are often contradictions, making it difficult to conclude whether a site is selling based on the privacy policy, even for a human.

The vast majority are fairly clear, not in the sense that they explicitly say "we are selling your data," but if they are doing so, there are just a handful of variations, e.g., "we are not selling but what we do may fall under what the CCPA calls 'selling'" etc. If an policy talks about selling, an expert (or machine learning model) will recognize it most of the time.

Looking through the disconnect list, their methodology for finding whether a site sells

Do they have such methodology? They only have an "advertising" category, right? And that could be contextual, behavioral, first-party, ...? So, are you saying that "advertising" == "selling," usually?

Of course, the bigger issue still is that some sites may use ad networks but may be using an option that does not use contextual advertising,

"Behavioral," right? Contextual would not require selling.

Another method I could explore is navigating to a home page, automatically finding the privacy policy, and then searching for keywords using regex to try

A rule-based approach is not the right tool here. I'd rather train our own machine learning model. If you want, feel free to explore. But before we go all in on such endeavor, we should understand that there are no existing solutions that do the job, especially the Disconnect list.

Can you provide some concrete results, @OliverWang13? Say, 20 random ad networks from the advertising Disconnect list.

  1. Which ones are selling?
  2. Which ones have a setting that does non-selling advertising?
  3. How can those settings be identified? Is there some flag. Can we leverage what CMPs are doing?
  4. Which setting is on by default, the selling or the non-selling one?
OliverWang13 commented 12 months ago

So, are you saying that "advertising" == "selling," usually?

From the 100 sites that I looked at that usually seems to be the case. Disconnect also has labels like FingerprintingGeneral and FingerprintingInvasive that point towards the tracking that ad networks perform. They also cite the privacy policies of these networks to show the exact language used.

"Behavioral," right? Contextual would not require selling.

Yes, sorry, I misspoke.

Can you provide some concrete results, @OliverWang13?

I'll get on this.

SebastianZimmeck commented 12 months ago

Excellent!

Feel free to write the results in this issue or to create a separate document in our Google Drive.

SebastianZimmeck commented 12 months ago

Also, if you can copy in your write-up the link to the privacy policy and the language that identifies selling/not selling from the policy, that would be helpful.

SebastianZimmeck commented 12 months ago

Here is an overview of the current state laws that we need to map to what we see on the websites (e.g., in web traffic or in privacy policies). If you check the links, there are definitions of "targeted advertising" that I am omitting here for the time being. Let's not worry about applicability for the time being.

State Law Applicability Opt out Right Privacy Preference Signal
California "... business in the State of California, and that satisfies one or more of the following thresholds: (A) As of January 1 of the calendar year, had annual gross revenues in excess of twenty-five million dollars ($25,000,000) in the preceding calendar year, as adjusted pursuant to paragraph (5) of subdivision (a) of Section 1798.185. (B) Alone or in combination, annually buys, sells, or shares the personal information of 100,000 or more consumers or households. (C) Derives 50 percent or more of its annual revenues from selling or sharing consumers’ personal information." "(a) A consumer shall have the right, at any time, to direct a business that sells or shares personal information about the consumer to third parties not to sell or share the consumer’s personal information. " "A consumer may authorize another person to opt-out of the sale or sharing of the consumer’s personal information and to limit the use of the consumer’s sensitive personal information on the consumer’s behalf, including through an opt-out preference signal, ..."
Colorado "THIS PART 13 APPLIES TO A CONTROLLER THAT: (a) CONDUCTS BUSINESS IN COLORADO OR PRODUCES OR DELIVERS COMMERCIAL PRODUCTS OR SERVICES THAT ARE INTENTIONALLY TARGETED TO RESIDENTS OF COLORADO; AND (b) SATISFIES ONE OR BOTH OF THE FOLLOWING THRESHOLDS: (I) CONTROLS OR PROCESSES THE PERSONAL DATA OF ONE HUNDRED THOUSAND CONSUMERS OR MORE DURING A CALENDAR YEAR; OR (II) DERIVES REVENUE OR RECEIVES A DISCOUNT ON THE PRICE OF GOODS OR SERVICES FROM THE SALE OF PERSONAL DATA AND PROCESSES OR CONTROLS THE PERSONAL DATA OF TWENTY-FIVE THOUSAND CONSUMERS OR MORE. " "A CONSUMER HAS THE RIGHT TO OPT OUT OF THE PROCESSING OF PERSONAL DATA CONCERNING THE CONSUMER FOR PURPOSES OF: (A) TARGETED ADVERTISING; (B) THE SALE OF PERSONAL DATA; OR (C) PROFILING IN FURTHERANCE OF DECISIONS THAT PRODUCE LEGAL OR SIMILARLY SIGNIFICANT EFFECTS CONCERNING A CONSUMER. " "A CONTROLLER THAT PROCESSES PERSONAL DATA FOR PURPOSES OF TARGETED ADVERTISING OR THE SALE OF PERSONAL DATA SHALL ALLOW CONSUMERS TO EXERCISE THE RIGHT TO OPT OUT OF THE PROCESSING OF PERSONAL DATA CONCERNING THE CONSUMER FOR PURPOSES OF TARGETED ADVERTISING OR THE SALE OF PERSONAL DATA PURSUANT TO SUBSECTIONS (1)(a)(I)(A) AND (1)(a)(I)(B) OF THIS SECTION BY CONTROLLERS THROUGH A USER-SELECTED UNIVERSAL OPT-OUT MECHANISM THAT MEETS THE TECHNICAL SPECIFICATIONS ESTABLISHED BY THE ATTORNEY GENERAL PURSUANT TO SECTION 6-1-1313."
Connecticut "The provisions of sections 1 to 11, inclusive, of this act apply to persons that conduct business in this state or persons that produce products or services that are targeted to residents of this state and that during the preceding calendar year: (1) Controlled or processed the personal data of not less than one hundred thousand consumers, excluding personal data controlled or processed solely for the purpose of completing a payment transaction; or (2) controlled or processed the personal data of not less than twenty-five thousand consumers and derived more than twenty-five per cent of their gross revenue from the sale of personal data." "A consumer shall have the right to: ... (5) opt out of the processing of the personal data for purposes of (A) targeted advertising, (B) the sale of personal data, except as provided in subsection (b) of section 6 of this act, or (C) profiling in furtherance of solely automated decisions that produce legal or similarly significant effects concerning the consumer." "(ii) Not later than January 1, 2025, allowing a consumer to opt out of any processing of the consumer's personal data for the purposes of targeted advertising, or any sale of such personal data, through an opt out preference signal sent, with such consumer's consent, by a platform, technology or mechanism to the controller indicating such consumer's intent to opt out of any such processing or sale"
Montana "The provisions of [sections 1 through 12] apply to persons that conduct business in this state or persons that produce products or services that are targeted to residents of this state and: (1) control or process the personal data of not less than 50,000 consumers, excluding personal data controlled or processed solely for the purpose of completing a payment transaction; or (2) control or process the personal data of not less than 25,000 consumers and derive more than 25% of gross revenue from the sale of personal data." "A consumer must have the right to: (e) opt out of the processing of the consumer's personal data for the purposes of: (i) targeted advertising; (ii) the sale of the consumer's personal data, except as provided in [section 7(2)]; or (iii) profiling in furtherance of solely automated decisions that produce legal or similarly significant effects concerning the consumer." "Opt-out methods must: (b) by no later than January 1, 2025, allow a consumer to opt out of any processing of the consumer's personal data for the purposes of targeted advertising, or any sale of such personal data through an opt-out preference signal sent with the consumer's consent, to the controller by a platform, technology, or mechanism ..."
Oregon "Sections 1 to 9 of this 2023 Act apply to any person that conducts business in this state, or that provides products or services to residents of this state, and that during a calendar year, controls or processes: (a) The personal data of 100,000 or more consumers, other than personal data controlled or processed solely for the purpose of completing a payment transaction; or (b) The personal data of 25,000 or more consumers, while deriving 25 percent or more of the person’s annual gross revenue from selling personal data" "a consumer may: Opt out from a controller’s processing of personal data of the consumer that the controller processes for any of the following purposes: (A) Targeted advertising; (B) Selling the personal data; or (C) Profiling the consumer in furtherance of decisions that produce legal effects or effects of similar significance." "Allow a consumer or authorized agent to send a signal to the controller that indicates the consumer’s preference to opt out of the sale of personal data or targeted advertising under section 3 (1)(d) of this 2023 Act by means of a platform, technology or mechanism ..."
Texas "This chapter applies only to a person that: (1) conducts business in this state or produces a product or service consumed by residents of this state; (2) processes or engages in the sale of personal data; and (3) is not a small business as defined by the United States Small Business Administration, ..." "opt out of the processing of the personal data for purposes of: (A) targeted advertising; (B) the sale of personal data; or (C) profiling in furtherance of a decision that produces a legal or similarly significant effect concerning the consumer." "A consumer may designate an authorized agent using a technology, including a link to an Internet website, an Internet browser setting or extension, or a global setting on an electronic device, that allows the consumer to indicate the consumer’s intent to opt out of the processing."
Delaware "This chapter applies to persons that conduct business in the State or persons that produce products or services that are targeted to residents of the State and that during the preceding calendar year did any of the following: (1) Controlled or processed the personal data of not less than 35,000 consumers, excluding personal data controlled or processed solely for the purpose of completing a payment transaction. (2) Controlled or processed the personal data of not less than 10,000 consumers and derived more than 20 percent of their gross revenue from the sale of personal data." "(a) A consumer has the right to do all of the following:... (6) Opt out of the processing of the personal data for purposes of any of the following: a. Targeted advertising. b. The sale of personal data, except as provided in subsection (b) of § 12D-106 of this chapter. c. Profiling in furtherance of solely automated decisions that produce legal or similarly significant effects concerning the consumer." "The consumer may designate such authorized agent by way of, among other things, a platform, technology, or mechanism, including an Internet link or a browser setting, browser extension, or global device setting, indicating such consumer’s intent to opt out of such processing. For the purposes of such designation, the platform, technology, or mechanism may function as the agent for purposes of conveying the consumer’s decision to opt-out."
SebastianZimmeck commented 12 months ago

From the GPC spec as it currently stands:

The Colorado Privacy Act (CPA) gives consumers the legal right to opt out of both the sale of their information as well as the use of their data for cross-site targeted advertising, including through the use of “universal opt-out mechanisms that clearly communicate a consumer’s affirmative, freely given, and unambiguous choice to opt out.” Under the CPA, the GPC signal will be intended to communicate a request to opt out of both the sale of their personal information and the use of their personal information for targeted advertising.

Similarly, the Connecticut Data Privacy Act (CDPA) gives consumers separate opt-out rights for data sales and targeted advertising, including through an “authorized agent by way of, among other things, a technology, including, but not limited to, an Internet link or a browser setting, browser extension or global device setting.” Under the CDPA, the GPC signal will be intended to communicate a request to opt out of both the sale of their personal information and the use of their personal information for targeted advertising.

SebastianZimmeck commented 11 months ago

As discussed, @sophieeng and @OliverWang13 (and possibly, @Jocelyn0830, if she has time) will pick random sites from Disconnect's services.json, around 20 from each category (advertising, analytics, ...), and check whether they are involved in sell/share/targeting transactions as described above by the various laws.

The idea is that, if most services on the Disconnect list are involved in such transactions, we can replace our Do Not Sell link identification methodology with the the Disconnect list. In other words, if a service is on the list, we mark a first party site on which we find the service (e.g., via Privacy Pioneer) as seller/targeter/sharer. Evidence for services engaging in such transactions may be found in privacy policies, developer documentation, elsewhere ...

@OliverWang13 will set up a Google Sheet and share with everyone where we keep track of the results. The Sheet should include the link to the third party services, evidence (e.g., quotes from privacy policies), and links to the evidence.

OliverWang13 commented 11 months ago

@sophieeng (and possibly, @Jocelyn0830), here is the google sheets link: https://docs.google.com/spreadsheets/d/1Qolfm1Akr84yUoNf_jhxnL_1KPcsXWxNbVY82Lse9dk/edit?usp=sharing

I will take rows 1-10, @sophieeng will take 11-20, @Jocelyn0830 if you have time you can take 21-30.

The "Selling" column is meant to demonstrate whether a site is selling (and using/sharing) personal information or not. The proof is in the privacy policy or documentation that leads you to your conclusion.

"Setting not to sell" is whether or not the ad network has the option to use only contextual advertising or advertising without using personal cross-contextual information. "Proof" is once again the language that leads towards the conclusion made.

"By Default Selling" is whether that ad network has cross-contextual advertising turned on by default. "Proof" is as it was previously.

"Sign of not selling" is whether or not a site has some sort of tag or indication for when it is selling or not. "What is it" is what the sign would be, and "proof" is as usual.

I also put "relevant links" to give links to documentation or the privacy policy to cite your quotes.

Some of these things are very difficult to find, especially if the documentation is hidden behind a login or is unavailable. In those cases, it is alright to leave a section inconclusive and someone else could try and take a look.

SebastianZimmeck commented 11 months ago

A good plan!

@OliverWang13, a few comments and questions:

SebastianZimmeck commented 11 months ago

Two important points for the analysis:

@OliverWang13, @sophieeng, (and @Jocelyn0830), can you look up how "profiling," "targeting" etc. in the laws above are defined and post it here? That will add some additional clarity as to what we have to look out for.

Edit: Probably the easiest to add "profiling" etc. columns to the table above.

SebastianZimmeck commented 11 months ago

@katehausladen (and @OliverWang13 and @Jocelyn0830), I am wondering now whether we actually need the Privacy Pioneer integration (@katehausladen's PR).

If we do not go beyond matching entries from the Disconnect list, we may not need Privacy Pioneer (unless we want to use if it is convenient, for example). Additional Privacy Pioneer functionalities beyond the Disconnect list, are, for example, identifying whether location data is shared with third parties. Here are all features listed.

A simpler alternative could be to directly work with Firefox's built in Enhanced Tracking Protection. In particular, the urlClassification object of the webRequestonHeadersReceived API could be useful as it returns the classification of a third party site based on the Disconnect list.

Could you look into that @katehausladen? (Obviously, this is only works if we decide to go forward with the Disconnect list in general, which is dependent on the analysis.)

OliverWang13 commented 11 months ago

The google sheet has been moved into the shared drive.

We will be selecting the sites by counting the total in each category and using a random number generator to assign different sites to each person.

We are also thinking we should look at sites from each category, but analyze a different amount of sites based on their possible importance. Currently, we are thinking that Advertising and Analytics are the two most important, so we will each analyze ten sites for those, and then the rest we will do five sites. For reference, the 9 categories are:

We are also adding columns to see whether the networks we are investigating are also listed under other labels. These will be called Crosslisted and Crosslisted Category. We are changing some column wording but will document the meaning in a methodology tab.

@sophieeng and I will also look through the laws for definitions of "profiling" and "targeting" and we will repost the table with the new columns.

sophieeng commented 11 months ago

I did not include parts of targeted advertising definition detailing what targeted advertising doesn't include. If you go to the targeted advertising definition excerpt that's in this table, the full definition of what is not included is typically right under it.

State Law Targeted Advertising Profiling
California I could not find in defining of "targeted advertising" in this California State Law. "“Profiling” means any form of automated processing of personal information, as further defined by regulations pursuant to paragraph (16) of subdivision (a) of Section 1798.185, to evaluate certain personal aspects relating to a natural person and in particular to analyze or predict aspects concerning that natural person’s performance at work, economic situation, health, personal preferences, interests, reliability, behavior, location, or movements."
Colorado "(25) "TARGETED ADVERTISING": (a) MEANS DISPLAYING TO A CONSUMER AN ADVERTISEMENT THAT IS SELECTED BASED ON PERSONAL DATA OBTAINED OR INFERRED OVER TIME FROM THE CONSUMER'S ACTIVITIES ACROSS NONAFFILIATED WEBSITES, APPLICATIONS, OR ONLINE SERVICES TO PREDICT CONSUMER PREFERENCES OR INTERESTS;" "(20) "PROFILING" MEANS ANY FORM OF AUTOMATED PROCESSING OF PERSONAL DATA TO EVALUATE, ANALYZE, OR PREDICT PERSONAL ASPECTS CONCERNING AN IDENTIFIED OR IDENTIFIABLE INDIVIDUAL'S ECONOMIC SITUATION, HEALTH, PERSONAL PREFERENCES, INTERESTS, RELIABILITY, BEHAVIOR, LOCATION, OR MOVEMENTS."
Connecticut "(28) "Targeted advertising" means displaying advertisements to a consumer where the advertisement is selected based on personal data obtained or inferred from that consumer's activities over time and across nonaffiliated Internet web sites or online applications to predict such consumer's preferences or interests." "(22) "Profiling" means any form of automated processing performed on personal data to evaluate, analyze or predict personal aspects related to an identified or identifiable individual's economic situation, health, personal preferences, interests, reliability, behavior, location or movements."
Montana "(25) (a) "Targeted advertising" means displaying advertisements to a consumer in which the advertisement is selected based on personal data obtained or inferred from that consumer's activities over time and across nonaffiliated internet websites or online applications to predict the consumer's preferences or interests." "(19) "Profiling" means any form of automated processing performed on personal data to evaluate, analyze, or predict personal aspects related to an identified or identifiable individual's economic situation, health, personal preferences, interests, reliability, behavior, location, or movements."
Oregon "(19)(a) “Targeted advertising” means advertising that is selected for display to a consumer on the basis of personal data obtained from the consumer’s activities over time and across one or more unaffiliated websites or online applications and is used to predict the consumer’s preferences or interests." "(16) “Profiling” means an automated processing of personal data for the purpose of evaluating, analyzing or predicting an identified or identifiable consumer’s economic circumstances, health, personal preferences, interests, reliability, behavior, location or movements."
Texas "(31)AA"Targeted advertising" means displaying to a consumer an advertisement that is selected based on personal data obtained from that consumer ’s activities over time and across nonaffiliated websites or online applications to predict the consumer ’s preferences or interests." "(24)AA"Profiling" means any form of solely automated processing performed on personal data to evaluate, analyze, or predict personal aspects related to an identified or identifiable individual ’s economic situation, health, personal preferences, interests, reliability, behavior, location, or movements."
Delaware "(33) “Targeted advertising” means displaying advertisements to a consumer where the advertisement is selected based on personal data obtained or inferred from that consumer’s activities over time and across nonaffiliated Internet web sites or online applications to predict such consumer’s preferences or interests." "(25) “Profiling” means any form of automated processing performed on personal data to evaluate, analyze, or predict personal aspects related to an identified or identifiable individual’s economic situation, health, demographic characteristics, personal preferences, interests, reliability, behavior, location, or movements."
katehausladen commented 11 months ago

A simpler alternative could be to directly work with Firefox's built in Enhanced Tracking Protection. In particular, the urlClassification object of the webRequestonHeadersReceived API could be useful as it returns the classification of a third party site based on the Disconnect list.

Assuming we use the Disconnect list, I would definitely be in favor of using this instead of Privacy Pioneer; it's a lot easier than running 2 separate crawls.

I can filter web requests for ones that are tagged by Firefox (i.e. they're in the Disconnect list). In the image below, domain is the site I'm visiting, a is the request url, and b is the urlClassification object (i.e. the first/third party tags of the request).

Screenshot 2023-09-17 at 9 48 51 PM

What information would we want to store? Just a boolean of whether the site has any requests that indicate selling? Or would we want to have the exact tags/request urls somewhere for reference?

SebastianZimmeck commented 11 months ago

Great, @sophieeng and @katehausladen!

What information would we want to store? Just a boolean of whether the site has any requests that indicate selling? Or would we want to have the exact tags/request urls somewhere for reference?

Generally, whatever information we can get to make the determination that the request fits a definition of targeted advertising, profiling, selling, or sharing.

SebastianZimmeck commented 11 months ago

@sophieeng, can you add the info from the new Delaware Personal Data Privacy Act to our tables above (1, 2)?

OliverWang13 commented 11 months ago

We will have to create new columns and re-examine our sites for signs of specific targeted advertising, profiling, selling, and sharing. Please refer to the tables above to learn about what constitutes each category.

Another important topic is that we might like to rule out some different aspects of the disconnect list. The definition for each of their categories is found here. From a quick skim, we can definitely eliminate Anti-Fraud (not even found in their services document), Cryptoming, Session Replay (also not found in services.json), Content, and probably Email + Email Aggressive. I am not sure whether fingerprinting + fingerprinting invasive warrant more investigation (Any thoughts, @SebastianZimmeck?). Definitely, Advertising, Analytics, and Social should still be investigated.

SebastianZimmeck commented 11 months ago

Sounds good, @OliverWang13!

fingerprinting + fingerprinting invasive warrant more investigation (Any thoughts, @SebastianZimmeck?)

Yes, they warrant investigation. Though, Advertising, Analytics, and Social seem more important to me. So, I would start with those.

katehausladen commented 11 months ago

Since we likely won't go forward with Privacy Pioneer, I made a separate branch to integrate the urlClassification object. I decided to store the urlClassification information in an object: the keys are the tags (i.e. tracking, fingerprinting, full list here), and the values are a list of urls that have this tag. I trimmed the urls so that they don't contain the "https://www." or the path. This is stored in the urlClassification column.

This is what this looks like in practice:

Screenshot 2023-09-26 at 12 45 44 PM

The new sql query to make the database table with the urlClassification column is: CREATE TABLE entries (id INTEGER PRIMARY KEY AUTO_INCREMENT, site_id INTEGER, domain varchar(255), dns_link BOOLEAN, sent_gpc BOOLEAN, uspapi_before_gpc varchar(255), uspapi_after_gpc varchar(255), usp_cookies_before_gpc varchar(255), usp_cookies_after_gpc varchar(255), OptanonConsent_before_gpc varchar(800), OptanonConsent_after_gpc varchar(800), urlClassification varchar(5000));

OliverWang13 commented 11 months ago

I have finished summing together a lot of the important values that we found in the sheets.

52/67 (77.6%) of all of the sites that we investigated sell/buy data.

Of the 67, 43 can be found in the Advertising section. Of the 43, 38 of them sell/buy data (88.4%). If you are interested in more specifics, you can find the rest here in the results sheet.

@SebastianZimmeck, if you would like a larger sample size, @sophieeng and I could investigate a few more advertising sites. However, our 88% accuracy for 43 sites is not bad.

SebastianZimmeck commented 11 months ago

Overall, using the Disconnect list categories seems to make sense.

We cannot claim for an individual site that it is a site that is selling/sharing/targeting, but we can say that about 80% of sites in the aggregate are doing so.

The question is now how to pick from the Disconnect list to increase the true positives as much as possible while minimizing the false positives.

We need a 95% confidence interval or some similar statistical measure to make a claim that the aggregate results are reliable to that extent.

One other option we can still consider if we are not satisfied with this approach is a manually curated list of sites that are selling/sharing/targeting. That way we get to 100%.

@sophieeng and @OliverWang13 will look into this with @katehausladen taking an overall view here.

SebastianZimmeck commented 11 months ago

It seems to me we should do a two-part analysis approach:

  1. Approximation of whether a site sells/shares/tracks/profiles via the Disconnect list
  2. Deterministic analysis approach of the sites that we manually curate for our test set. For those sites we know for sure if they sell/share/track/profile because we have checked it manually. We might as well leverage that knowledge in our analysis

For 2 we can be sure for individual sites. For 1 we can make claims in the aggregate for the set of sites (considering that only about 80% performing the practices in question).

katehausladen commented 9 months ago

Since the analysis is done, I went ahead and changed the extension to reflect our new methodology.

Changes: 1) the extension now only logs requests from the categories we selected (Advertising, Fingerprinting-Gen, and Social) 2) the extension no longer looks for DNSLs

The new database creation command for the issue-59-3 branch is CREATE TABLE entries (id INTEGER PRIMARY KEY AUTO_INCREMENT, site_id INTEGER, domain varchar(255), sent_gpc BOOLEAN, uspapi_before_gpc varchar(255), uspapi_after_gpc varchar(255), usp_cookies_before_gpc varchar(255), usp_cookies_after_gpc varchar(255), OptanonConsent_before_gpc varchar(800), OptanonConsent_after_gpc varchar(800), urlClassification varchar(5000));

SebastianZimmeck commented 9 months ago

The new database creation command for the issue-59-3 branch is ...

Can you include that in the readme?

katehausladen commented 9 months ago

Since we've settled on a this method and merged the code, I'm closing this issue. I'll update the readme to reflect the changes (issue #65).