privacy-tech-lab / gpc-web-crawler

Web crawler for detecting websites' compliance with GPC privacy preference signals at scale
https://privacytechlab.org/
MIT License
4 stars 3 forks source link

Clarify and possibly fix crawler data scheme and issues #42

Closed SebastianZimmeck closed 1 year ago

SebastianZimmeck commented 1 year ago

In her COMP 360 project, which covers our work here, @katehausladen mentioned various smaller issues with the crawler. I am opening this issue as an omnibus issue stub. @katehausladen will take the lead on this issue, elaborate further (and/or open new issues as necessary), and discuss with @Jocelyn0830.

katehausladen commented 1 year ago

Here are some of the things I found:

katehausladen commented 1 year ago

What is the extension supposed to put in the opt out columns if the USPS is the same before and after the GPC signal is sent? It is inconsistent.

Screenshot 2023-05-07 at 11 35 14 PM Screenshot 2023-05-07 at 11 33 58 PM

This does not seem to be a problem when the value of the USPS changes after the GPC signal is sent.

SebastianZimmeck commented 1 year ago

Our extension was failing to send a GPC signal to more sites that in the paper data

What is unclear to me is, how is that even possible. Isn't sending a GPC signal entirely in our control? Is this maybe just incorrectly recorded while we were actually sending a GPC signal?

It may be useful to re-crawl the 504 domains that used to have a DNSL and then spot check some of them if necessary.

It would be also useful for sites that do not have a DNSL in the current crawl to (spot-)check whether they used to have a DNSL in the previous crawl. And also check to which extent there could be an analysis failure on our end now.

SebastianZimmeck commented 1 year ago

@katehausladen will continue to be on the lead here (and the issue management). The questions for each of the three points are:

Some in-depth analysis of sites that failed and trying to logically suss out where the issue comes from may help (e.g., using the browser developer tools, a web proxy like mitm, ...).

In addition to @Jocelyn0830, @OliverWang13 and @sophieeng will also help out. In that regard, is the setup clear as described in the readme? If not, what do we need to change?

katehausladen commented 1 year ago

3rd point is resolved: we decided that we will just filter out duplicate domains when we are analyzing the crawl data.

DNSLs: I compared 3 crawls: the paper crawl, the crawl by the Mac minis discussed above, and a crawl I did a few days ago on my Mac. I made a spreadsheet that classifies sites based on which crawls found a DNSL on that site. I manually looked at 40 sites that had a DNSL in the paper but did not have a DNSL in either recent crawl. These sites fall into a few main groups:

  1. 21/40 It does still exist. Note that for 3 of these sites, the DNSL appears in a banner that takes 1-2 seconds to load (site is checking to see if it’s a CA IP address). Maybe this delay causes a timing issue so that our extension is not finding the link after it appears. For the other 18 sites, we'll have to figure out why it's not being found. (picture of delayed banner below)

    banner-dnsl
  2. 10/40 The link is now named some iteration of “Your Privacy Choices”/“Your Privacy Options”/“Manage Your Privacy Choices”. If this terminology is accepted as equivalent to a DNSL, we should expand the regular expression to include variations of this. As a side note, some of these sites included the icon Lorrie Cranor and her students developed.

  3. 3/40 It does exist, but you have to do something to see it. These fall into 2 subcategories

    • There is a “Customize My Ad Experience” link (pic 1), and if you click it, it becomes a “Do Not Sell or Share My Information Link” (pic 2). I’m not sure if this is considered compliant? This happens for all sites that are “A Raptive Partner Site.” Screenshot 2023-05-15 at 1 35 47 PM Screenshot 2023-05-15 at 1 35 54 PM

For bk.com: if you click the menu icon (top left pic 1), the site redirects to bk.com/account, and the DNSL is there (pic 2). Since it’s on a different path, the regex is not finding it. When you close the menu icon, it redirects back to bk.com. Again, I’m not sure if this is technically compliant?

bk1 bk2
  1. 3/40 The DNSL does not exist anymore (or at least I could not find it)
  2. 3/40 The site does not exist anymore or won’t load with the VPN. Obviously there’s not much we can do about this.
SebastianZimmeck commented 1 year ago

3rd point is resolved: we decided that we will just filter out duplicate domains when we are analyzing the crawl data.

Excellent! (I added checkmarks above.)

I compared 3 crawls: the paper crawl, the crawl by the Mac minis discussed above, and a crawl I did a few days ago on my Mac.

That is a great analysis, @katehausladen! It strikes me that --- at least partially --- the issue is on our end and can be resolved, for example, by modifying the regex for identifying Do Not Sell links. @OliverWang13, as you have worked extensively on the regex, can you also look into how it and the rest of our code should be modified to iron out the issues @katehausladen describes?

@katehausladen (and everyone), I copied your Google Sheet into our GPC Web Google Drive. So let's use the version there. (And let's store any documents that we do not have on GitHub in the GPC Web Google Drive. Feel free to create documents and directories there as you see fit.)

katehausladen commented 1 year ago

The sent_gpc issue (point 1) is tentatively resolved.

The GPC signal not sending was an issue with our extension, not a recording problem. For sites where sent_gpc = 0 (i.e. the GPC signal is not sent), the function that adds the GPC header is never executed. After looking more at the code, I think this was a problem because the function that calls sending the headers is async but was not called with await. So, the execution would keep going and skip sending the header. I've only tested 1260 sites, but it appears that this issue is fixed.

I think something like this may be contributing to our problem with success rate, so I'm going to look into that in the next few days.

katehausladen commented 1 year ago

The latest crawl with the regex updates found 807 DNSLs/1670 sites crawled (~48.3%). This is nearly identical to the percentage found in the paper. I can do a more in-depth comparison of the sites in the paper vs this crawl.

katehausladen commented 1 year ago

Overall, the crawler is doing a much better job identifying DNSLs, so I'm closing this issue.