privacy-tech-lab / gpc-web-crawler

Web crawler for detecting websites' compliance with GPC privacy preference signals at scale
https://privacytechlab.org/
MIT License
4 stars 2 forks source link

Analyze CCPA compliance of 1,000 sites or so #12

Closed SebastianZimmeck closed 2 years ago

SebastianZimmeck commented 2 years ago

Once we have finalized the current implementation tasks, it would be a nice paper contribution to actually crawl some sites. I am imagining running OptMeowt in analysis mode on 1,000 sites or so. "Small big data" as a proof of concept of our data analysis. We know the performance of the analysis mode. So, this would not require any ground truth analysis. Just running OptMeowt in analysis mode and recording the results.

Maybe, we could use the IAB members directory to find sites with Privacy String/USPAPI implementation. Not sure if the Tranco list is any help.

I do not think it is necessary to build any batch analysis functionality into OptMeowt as that would add another layer of complexity. However two other options are (1) manually visiting different sites or (2) implementing some external driver. On the latter, here is what I wrote in a previous issue that we had not pursued at the time.

  1. Use an external script to drive a Selenium instance with OptMeowt installed (in our script and dat repo)

The crawling functionality could also be external to OptMeowt. I played around with this option using a basic setup of Selenium for Firefox and installing Firefox extensions in Selenium. There are also some setup steps, such as changing the path variable. Essentially, OptMeowt would stay as is and we provide an external script and setup instructions to do a crawl.

  • Advantages: No additional complexity in OptMeowt's codebase. Potentially easier to fix problems as analysis and crawling logic is clearly separated. Probably, also quicker, especially, with the headless crawling option. A little bit more elegant in headless mode as there is no visible opening and closing of tabs, navigating, etc.
  • Disadvantages: Probably, easier detectable as automated crawling with all problems coming with it. On the other hand, we are hitting different websites and only briefly. So, maybe not a big deal. Setup is more complicated, especially, we would need to ensure that OptMeowt is in Analysis mode and not Protection mode.

Not sure how to exactly tackle it. Let's discuss next meeting ...

Jocelyn0830 commented 2 years ago

Below is a spotcheck of 20 US sites on the builtwith TCF list (https://pro.builtwith.com/report/list/e813bdf2-dcc3-466d-88ec-25b20b3c2450) to find out how many have the IAB US Privacy String/USPAPI. Here are the sites:

Out of 20 sites tested, 15 sites have the IAB US Privacy String/USPAPI.

SebastianZimmeck commented 2 years ago

OK, thanks, @Jocelyn0830. That looks promising!

I looked more into this. @Jocelyn0830, can you do the following?

I may also contact the builtwith.com people and ask if they would be willing to give us the whole list or a bigger part for research purposes. Otherwise, their basic plan is $300, and we do not have a budget for that.

SebastianZimmeck commented 2 years ago

I just realized it is even possible to zoom in on California. So, let's transcribe all the different California sites first, see, how many we get.

SebastianZimmeck commented 2 years ago

Also, the very high traffic volume ones are important to get because they are most likely subject to the California Consumer Privacy Act. So, please include those as well, @Jocelyn0830.

Jocelyn0830 commented 2 years ago

I transcribed all the California sites and US sites with very high traffic volume into the Google sheet. In total, I got 224 sites with 65 California sites. 189 sites have very high traffic volume.

Below is a spotcheck of 25 sites in the Google sheet to find out how many have the IAB US Privacy String/USPAPI:

Out of 25 sites tested, 19 sites have the IAB US Privacy String/USPAPI.

SebastianZimmeck commented 2 years ago

Nice!

Jocelyn0830 commented 2 years ago

The Google sheet is updated with 730 unique sites transcribed from the TCF US list.

SebastianZimmeck commented 2 years ago

Excellent work, @Jocelyn0830!

SebastianZimmeck commented 2 years ago

This issue is superseded by #16.