Compare Crawler + VM to Local Data

dadak-dom commented 4 months ago

I figured that it made more sense to separate this task from testing the crawler (#9) since they're fairly different when it comes to methodology (making sure the extension works vs. showing that the environment in which the extension is running is not disruptive). This will make it easier for us to track our progress on this issue instead of putting everything into #9.

As a reminder, this task is essentially to provide justification for our crawling methodology (crawler + VM to replicate regular user experience). The goal is to show that including these two variables doesn't significantly impact the data that we collect, i.e., that it's a fair representation of what a regular user with Privacy Pioneer would collect. While @atlasharry is finalizing the test list, I am making progress on this front. @atlasharry has also volunteered to collect data in Korea in addition to any results that I get.

dadak-dom commented 4 months ago

I've finished the scripts for the comparisons that we'll be making when comparing local data vs. VM data vs. VM + Crawler data. The Google Colab can be found here in our Google Drive. Instructions for running the script are found at the link, so when the time comes, @atlasharry should be able to just upload his data, run the script, and get some results. I will also probably do my own data collection, but that will probably be after the ground truth analysis for #9 as discussed last Friday.

Here's a quick rundown of why this script was made, and what it's meant to do:

We needed a way to be able to compare a website's treatment of a user depending on where Privacy Pioneer was collecting information from. However, natural fluctuations of web requests posed a challenge; crawling each location once would oftentimes result in major discrepancies between data collected. The vast majority of these discrepancies could be explained by the requests simply not existing for a particular run, but that wouldn't be a very convincing argument for the validity of our crawl procedure. How would we know that websites weren't treating users on a VM completely differently?
The simplest way to resolve this would be to visit the website multiple times. But then, how do we know that the requests aren't differing massively between runs? To solve this, I created some data analysis scripts.
The general idea of my scripts was to do a couple of things. I've also attached a little graphic that helped me visualize my process.
- Collect Privacy Pioneer data for each website from a test list multiple times. This data will be added to a "data pool", which will help us identify unique domains that are responsible for different categories of evidence.
- Create a "profile" of every website that we're testing on. Each website URL has it's own dictionary, which then has sub-dictionaries for each category, which contains lists of unique domains that have been encountered. These profiles serve to answer the question, "what requests can I expect to see for a given category when visiting a site?"
- To finish the profiles, we take note of how many times each URL appeared on a site for a category.
- Once we have an idea of what a site looks like (i.e., we've collected 3 runs for twitter.com both locally and in the cloud), we can compare these profiles across treatment groups, while excluding outliers, which we've defined to be requests that only occured once.
- I've also attached an example of what one of the resulting tables should look like. Proportions are also calculated, as well as statistics for the outliers.

SebastianZimmeck commented 4 months ago

Nice work, @dadak-dom!

The core measure we will be using are precision, recall, and F1 score.

SebastianZimmeck commented 4 months ago

As we discussed, we will proceed per @dadak-dom's two step analysis approach:

1. Privacy Pioneer + VM + Crawler "internal" analysis

In other words, does Privacy Pioneer work properly when adding a VM and the Crawler? This analysis is similar to just a straight Privacy Pioneer analysis except that we add VM + Crawler.

2. Privacy Pioneer + VM + Crawler vs Privacy Pioneer analysis

This is the second step, in which we check how much, if anything, of the analysis is incorrect if we add a VM and Crawler. To account for the natural fluctuation of site loads, @dadak-dom will run Privacy Pioneer + VM + Crawler and Privacy Pioneer each three times. If an analysis instance/result is showing up two (or three) times, we can count it. Otherwise, it might be natural fluctuation. We will also get a sense of the rate of fluctuation by comparing the intra-run differences (i.e., the rates inside the set of three runs of Privacy Pioneer + VM + Crawler and Privacy Pioneer). @atlasharry will repeat this test once he is in Seoul.

3. Misc

For both steps, we should calculate precision, recall, and F1. We should only calculated these scores from the positive instances and not the negative instances (i.e., not use weighted scores).

As @dadak-dom mentioned, this may take some time, which, however, is time well spent and will pay off later. @atlasharry will help @dadak-dom with the analysis.

We can also do some spotchecks after we have done the crawls.

Relates to #9.

SebastianZimmeck commented 1 month ago

I am closing this issue to continue all testing discussions in #9.

privacy-tech-lab / privacy-pioneer-web-crawler