privacy-tech-lab / privacy-pioneer-web-crawler

Web crawler for detecting websites' data collection and sharing practices at scale using Privacy Pioneer
https://privacytechlab.org/
MIT License
1 stars 0 forks source link

Compare Crawler + VM to Local Data #49

Closed dadak-dom closed 1 month ago

dadak-dom commented 4 months ago

I figured that it made more sense to separate this task from testing the crawler (#9) since they're fairly different when it comes to methodology (making sure the extension works vs. showing that the environment in which the extension is running is not disruptive). This will make it easier for us to track our progress on this issue instead of putting everything into #9.

As a reminder, this task is essentially to provide justification for our crawling methodology (crawler + VM to replicate regular user experience). The goal is to show that including these two variables doesn't significantly impact the data that we collect, i.e., that it's a fair representation of what a regular user with Privacy Pioneer would collect. While @atlasharry is finalizing the test list, I am making progress on this front. @atlasharry has also volunteered to collect data in Korea in addition to any results that I get.

dadak-dom commented 4 months ago

I've finished the scripts for the comparisons that we'll be making when comparing local data vs. VM data vs. VM + Crawler data. The Google Colab can be found here in our Google Drive. Instructions for running the script are found at the link, so when the time comes, @atlasharry should be able to just upload his data, run the script, and get some results. I will also probably do my own data collection, but that will probably be after the ground truth analysis for #9 as discussed last Friday.

Here's a quick rundown of why this script was made, and what it's meant to do:

image image image

SebastianZimmeck commented 4 months ago

Nice work, @dadak-dom!

The core measure we will be using are precision, recall, and F1 score.

SebastianZimmeck commented 4 months ago

As we discussed, we will proceed per @dadak-dom's two step analysis approach:

1. Privacy Pioneer + VM + Crawler "internal" analysis

In other words, does Privacy Pioneer work properly when adding a VM and the Crawler? This analysis is similar to just a straight Privacy Pioneer analysis except that we add VM + Crawler.

2. Privacy Pioneer + VM + Crawler vs Privacy Pioneer analysis

This is the second step, in which we check how much, if anything, of the analysis is incorrect if we add a VM and Crawler. To account for the natural fluctuation of site loads, @dadak-dom will run Privacy Pioneer + VM + Crawler and Privacy Pioneer each three times. If an analysis instance/result is showing up two (or three) times, we can count it. Otherwise, it might be natural fluctuation. We will also get a sense of the rate of fluctuation by comparing the intra-run differences (i.e., the rates inside the set of three runs of Privacy Pioneer + VM + Crawler and Privacy Pioneer). @atlasharry will repeat this test once he is in Seoul.

3. Misc

For both steps, we should calculate precision, recall, and F1. We should only calculated these scores from the positive instances and not the negative instances (i.e., not use weighted scores).

As @dadak-dom mentioned, this may take some time, which, however, is time well spent and will pay off later. @atlasharry will help @dadak-dom with the analysis.

We can also do some spotchecks after we have done the crawls.

Relates to #9.

SebastianZimmeck commented 1 month ago

I am closing this issue to continue all testing discussions in #9.