Clarify and possibly fix crawler data scheme and issues

SebastianZimmeck commented 1 year ago

In her COMP 360 project, which covers our work here, @katehausladen mentioned various smaller issues with the crawler. I am opening this issue as an omnibus issue stub. @katehausladen will take the lead on this issue, elaborate further (and/or open new issues as necessary), and discuss with @Jocelyn0830.

katehausladen commented 1 year ago

Here are some of the things I found:

[x] 1. Our extension was failing to send a GPC signal to more sites that in the paper data. sent gpc = 0 (i.e. not sent) for 51/1806 sites on the crawl in the paper sent_gpc = 0 for 248/1594 sites for recent crawl These sites barely overlapped. The sites that had sent_gpc = 0 in both crawls were: {'gamersheroes.com', 'bizjournals.com', 'wsls.com', 'google.com', 'kw.com'}. So, for the most part, it may not be an issue with the sites, but we should try crawling the sites that did not get a GPC signal to check. This also may be useful to note: For the recent crawl, all sites with sent_gpc = 0 also did not have a DNS link. For the paper crawl, 48/51 sites with sent_gpc = 0 also did not have a DNS link.
[x] 2. Our extension found many fewer DNS links that the crawl from the paper. Is this our extension or actually fewer have them? Paper: 875/1806 had DNSL (873 unique domains) Now: 446/1594 had DNSL (444 unique domains) I looked to see if there was any overlap in the sites. 369 unique domains had a DNSL in the paper and in the new crawl. 504 unique domains that had DNSL in the paper had no DNSL in the new crawl. 75 unique domains that had no DNSL in the paper now have a DNSL. It may be useful to re-crawl the 504 domains that used to have a DNSL and then spot check some of them if necessary.
[x] 3. There are sites in the crawl set that redirect to other sites that are also in the crawl set. This leads to double or even triple entries for some domains. In the crawl data from the paper, this happened with 3 sites ('jotform.com', 't-mobile.com', 'questdiagnostics.com' all have 2 entries). This appears to be becoming a bigger problem, too; it happened with 11 sites in the recent crawl (10 sites have 2 entries and ‘blizzard.com’ has 3 entries). We should decide on some standard policy for what to do about these repeat sites when analyzing crawl data.

katehausladen commented 1 year ago

What is the extension supposed to put in the opt out columns if the USPS is the same before and after the GPC signal is sent? It is inconsistent.

This does not seem to be a problem when the value of the USPS changes after the GPC signal is sent.

SebastianZimmeck commented 1 year ago

Our extension was failing to send a GPC signal to more sites that in the paper data

What is unclear to me is, how is that even possible. Isn't sending a GPC signal entirely in our control? Is this maybe just incorrectly recorded while we were actually sending a GPC signal?

It may be useful to re-crawl the 504 domains that used to have a DNSL and then spot check some of them if necessary.

It would be also useful for sites that do not have a DNSL in the current crawl to (spot-)check whether they used to have a DNSL in the previous crawl. And also check to which extent there could be an analysis failure on our end now.

SebastianZimmeck commented 1 year ago

@katehausladen will continue to be on the lead here (and the issue management). The questions for each of the three points are:

Is this an issue on our end (e.g., a bug in our code or not leaving enough time for a site to load) or a site issue (e.g., site has bugs or URL is simply wrong)?
How can we fix or mitigate the issue (or is it unfixable, which may be if it is a site issue)?

Some in-depth analysis of sites that failed and trying to logically suss out where the issue comes from may help (e.g., using the browser developer tools, a web proxy like mitm, ...).

In addition to @Jocelyn0830, @OliverWang13 and @sophieeng will also help out. In that regard, is the setup clear as described in the readme? If not, what do we need to change?

katehausladen commented 1 year ago

3rd point is resolved: we decided that we will just filter out duplicate domains when we are analyzing the crawl data.

DNSLs: I compared 3 crawls: the paper crawl, the crawl by the Mac minis discussed above, and a crawl I did a few days ago on my Mac. I made a spreadsheet that classifies sites based on which crawls found a DNSL on that site. I manually looked at 40 sites that had a DNSL in the paper but did not have a DNSL in either recent crawl. These sites fall into a few main groups:

21/40 It does still exist. Note that for 3 of these sites, the DNSL appears in a banner that takes 1-2 seconds to load (site is checking to see if it’s a CA IP address). Maybe this delay causes a timing issue so that our extension is not finding the link after it appears. For the other 18 sites, we'll have to figure out why it's not being found. (picture of delayed banner below)
10/40 The link is now named some iteration of “Your Privacy Choices”/“Your Privacy Options”/“Manage Your Privacy Choices”. If this terminology is accepted as equivalent to a DNSL, we should expand the regular expression to include variations of this. As a side note, some of these sites included the icon Lorrie Cranor and her students developed.
3/40 It does exist, but you have to do something to see it. These fall into 2 subcategories
- There is a “Customize My Ad Experience” link (pic 1), and if you click it, it becomes a “Do Not Sell or Share My Information Link” (pic 2). I’m not sure if this is considered compliant? This happens for all sites that are “A Raptive Partner Site.”

For bk.com: if you click the menu icon (top left pic 1), the site redirects to bk.com/account, and the DNSL is there (pic 2). Since it’s on a different path, the regex is not finding it. When you close the menu icon, it redirects back to bk.com. Again, I’m not sure if this is technically compliant?

3/40 The DNSL does not exist anymore (or at least I could not find it)
3/40 The site does not exist anymore or won’t load with the VPN. Obviously there’s not much we can do about this.

SebastianZimmeck commented 1 year ago

3rd point is resolved: we decided that we will just filter out duplicate domains when we are analyzing the crawl data.

Excellent! (I added checkmarks above.)

I compared 3 crawls: the paper crawl, the crawl by the Mac minis discussed above, and a crawl I did a few days ago on my Mac.

That is a great analysis, @katehausladen! It strikes me that --- at least partially --- the issue is on our end and can be resolved, for example, by modifying the regex for identifying Do Not Sell links. @OliverWang13, as you have worked extensively on the regex, can you also look into how it and the rest of our code should be modified to iron out the issues @katehausladen describes?

@katehausladen (and everyone), I copied your Google Sheet into our GPC Web Google Drive. So let's use the version there. (And let's store any documents that we do not have on GitHub in the GPC Web Google Drive. Feel free to create documents and directories there as you see fit.)

katehausladen commented 1 year ago

The sent_gpc issue (point 1) is tentatively resolved.

The GPC signal not sending was an issue with our extension, not a recording problem. For sites where sent_gpc = 0 (i.e. the GPC signal is not sent), the function that adds the GPC header is never executed. After looking more at the code, I think this was a problem because the function that calls sending the headers is async but was not called with await. So, the execution would keep going and skip sending the header. I've only tested 1260 sites, but it appears that this issue is fixed.

I think something like this may be contributing to our problem with success rate, so I'm going to look into that in the next few days.

katehausladen commented 1 year ago

The latest crawl with the regex updates found 807 DNSLs/1670 sites crawled (~48.3%). This is nearly identical to the percentage found in the paper. I can do a more in-depth comparison of the sites in the paper vs this crawl.

katehausladen commented 1 year ago

Overall, the crawler is doing a much better job identifying DNSLs, so I'm closing this issue.

privacy-tech-lab / gpc-web-crawler

Clarify and possibly fix crawler data scheme and issues #42