Test crawler performance

SebastianZimmeck commented 6 months ago

Before we start the crawl, we need to test the crawler's performance. So, we need to compare the manually observed groundtruth with the analysis results. We probably need a 100-site test set.

How do we select the test set sites given the different locations and states (issue #7) so that we have good test coverage?
One issue is that different loads of a site may lead to different trackers etc. detected. So, we need to look for the groundtruth and analysis results at exactly the same site load. So, maybe, just load one site, get both groundtruth and analysis results and check?
We need to document all of that

(@JoeChampeau and @jjeancharles feel free to participate here as well.)

SebastianZimmeck commented 5 months ago

Where are we with the testing protocol, @danielgoldelman?

danielgoldelman commented 5 months ago

Preliminary testing protocol

Run the crawl, collect the data. 
Separate the pp data from the entries data.

For pp data:

Create spreadsheet for each root url 
Log every piece of data into the spreadsheet with everything pp gives us, separated by pp data type

For all http request data: 

Create spreadsheet for each root url 
Do the most generic string matching with the values we are looking for. Note: we will have lists of keywords per vpn, we can get the ipinfo location while using the vpn by going to their site, and we can find monetization labels within the http requests. EX: if the zip code should be 10001, instead of a regex of \D10001\D, we look for just the string 10001. For every single key we could be looking for, we run it on the http requests gathered. Coallate these possible data stealing requests 
Go through every http request and label, adding to the spreadsheet when necessary

Now to bring both together: 

We have the two spreadsheet documents now. Time to classify 
Potentially in a new spreadsheet, place all http requests that occur in both pp and all http requests first. Then all that only occurred in pp. then all that only occurred in the http requests.
Perform classification

SebastianZimmeck commented 5 months ago

@danielgoldelman, can you reformat and reorder your comment? The order is very hard to follow, there are multiple numbers 1 and 2 after each other, etc.

danielgoldelman commented 5 months ago

@SebastianZimmeck sorry, the original comment was written on GitHub mobile, so formatting was hard to check. Changes made above.

SebastianZimmeck commented 5 months ago

@danielgoldelman and @dadak-dom, please carefully read section 3.2.2 and 3.3 (in particular, 3.3.2) of our paper. We can re-use much of the approach there. I do not think that we need an annotation of the ground truth, but both of you should check the ground truth (for whatever our definition of ground truth is) and come to the same conclusion.

We have to create a testing protocol along the lines of the following:

Select the set of analysis functionality that we are testing and how
- By default all analysis functionalities
- But how are we going to test for keywords, for example? How for email addresses, how for phone numbers, ...?
Pick a set of websites to test
- How many? Probably, 100 to 200. We need some reasonable standard deviation. For example, it is meaningless to test a particular analysis functionality for just one site because a successful test would not allow us to extrapolate and claim that we are successful for, say, 1,000 sites in our crawl set with that functionality. So, we need, say, 10 sites successfully analyzed to make that claim. Can you solidify that? What is the statistical significance of 10 sites? @JoeChampeau can help with the statistics. We should have some statistical power along the lines of "with 95% confidence our analysis of latitude is within the bounds of a 10% error rate" (e.g., if we detect 1,000 sites having a latitude with 95% confidence the real result is between 900 and 1,100 sites).
- Which sites to select? Again, the selected set should allow us to make the claim that if an analysis functionality works properly on the test set, it also works for the large set of sites that we crawl. So, we would need to pick a diverse set of sites covering every analysis functionality for each region that we cover. There should be no bias. For example, there will be no problems for monetization categories because they occur so frequently, but how do we ensure, e.g., that there is a meaningful number of sites that collect latitudes? Maybe, pick map sites from somewhere? How do we pick sites for keywords (assuming we are analyzing keywords)?
- Are we using the same test set of sites for each country/state? Yes, no, is some overlap OK, is it harmful, is it good, ...?
- How are we selecting sites randomly? Use random.org.
- We can't select any sites that we used for preliminary testing, i.e., validation. So, which are the sites, if any, that need to be excluded? If we randomly select an excluded site, how do we pick a new one? Maybe, just the next one on a given list.
Running the test
- Are we testing one site a time or run the complete test set? If we do the former we need to record all site data (and be absolutely sure that there are no errors and nothing omitted in the recording). We need to get both the analysis results and the ground truth data at the same time. The reason is that when we load a site multiple times, there is a good chance that not all trackers and other network connections are identical for both loads. So, the analysis results could diverge from the ground truth if the latter is based on a different load. We need to check the ground truth for the exact site load from which we got the analysis results. The alternative to a complete test set crawl is to do the analysis for one site at a time, i.e., visit a site, record the PP analysis results, use browser developer tools (and other tools, as necessary) to check the ground truth, record the evidence, record the ground truth evidence and result, then analyze the next site and so on. So, we would be doing multiple crawls of one site.
- We will also need to change the VPN for every different location.
- Who is going to run the test? @JoeChampeau has the computer. Is it you, @dadak-dom or @danielgoldelman? Both the PP analysis results and the ground truth should be checked by two people independently. This seems easier if only one test set crawl is done as opposed to the site-by-site approach.
Ground truth analysis
- How do we analyze the ground truth? Per your comment above, @danielgoldelman, I'll take that we do string matching in HTTP messages. Is that a reliable indicator? Maybe, we would also need to look at, say, browser API usage for latitude, i.e., the browser prompting the user to allow location access. What are the criteria to reliably analyze the ground truth? This can be different for our different functionalities.

These question cannot be answered in the abstract. @danielgoldelman and @dadak-dom, please play around with some sites for each analysis functionality and come up with a protocol to analyze it. For each functionality you need to be convinced that you can reliably identify true positives (and exclude false positives and false negatives). In other words, please do some validation tests.

dadak-dom commented 5 months ago

Who is going to run the test?

Would it make sense if @JoeChampeau runs the test, and then hands the data over to Daniel and me? I thought it would make sense since that's the computer that we will use to run the actual crawl. That way, we could avoid any potential issues arising when switching between windows and mac. Just a thought.

@SebastianZimmeck , the way I understand it, we will end up with three different site lists for each country (please correct me if I'm wrong)

Validation (what Daniel and I are doing now)
Test set (what we're preparing for and will soon be running)
The actual crawl list. We cannot have any overlap between the validation and the test set, but can the test set (and/or the validation) be derived from the actual crawl list? I would need to know this before I start making any lists for the test set.

SebastianZimmeck commented 5 months ago

Would it make sense if @JoeChampeau runs the test, and then hands the data over to Daniel and me?

It certainly makes sense, but that would depend on if @JoeChampeau has time as the task was originally @danielgoldelman's. (Given our slow speed, the point may more or less resolve itself since we will be all back on campus soon anyways.)

(please correct me if I'm wrong)

All correct.

but can the test set (and/or the validation) be derived from the actual crawl list?

Yes, the validation and test set can be derived from the crawl list.

dadak-dom commented 5 months ago

I have added my proposed crawl testing lists to the branch connected with this issue (issue-9). Here was my procedure:

For each country that we will crawl, create a new .csv file.
Go to random.org and have it generate a list of random integers from 1-525
Take the first six integers and find the matching URL from the general list
Regenerate the random integers and find the six matching URLs from the country specific list
If there seems to be a bias for one functionality, throw the list out and try again. (Or, if there is any overlap with sites that were used for validation. Luckily, this was never the case for me)
Repeat the process for each location we will crawl, so ten times total.

With point 5 I tried my best to include a fair share of sites that take locations, as monetization was easy to come by. @SebastianZimmeck let me know if any changes need to be made.

SebastianZimmeck commented 5 months ago

OK, sounds good!

So, our test set has a total of 120 sites? For each of the 10 countries/states 6 sites from the general list and 6 from the country-specific list.

With point 5 I tried my best to include a fair share of sites that take locations

How did you make the guess that a site takes locations?

dadak-dom commented 5 months ago

So, our test set has a total of 120 sites? For each of the 10 countries/states 6 sites from the general list and 6 from the country-specific list.

Yes, 120 sites total.

How did you make the guess that a site takes locations?

A couple of ways, e.g. visiting the site and seeing if it requests the location from the browser, or if PP detects a location, or if I know from my own browsing that the site would take locations.

SebastianZimmeck commented 5 months ago

OK, sounds good!

Feel free to go ahead with that test set then. As we discussed yesterday, maybe the performance is good. Otherwise we call that set a validation set, pick a new test set, and repeat the test (after fixing any shortcomings with the crawler and/or extension).

One important point, the PP analysis needs to be set up exactly as it would be in the real crawl, i.e., with VPN, a crawl not just the extension. Though, it does not need to be on the crawl computer.

dadak-dom commented 5 months ago

One more thing: I noticed this morning that there are a lot of sites in the general list that redirect to sites that are already on the list. Can't believe I didn't catch that sooner, so I'll fix that ASAP. Just to be safe, I'll also redo the general list part of the test set.

SebastianZimmeck commented 5 months ago

Great!

dadak-dom commented 5 months ago

@SebastianZimmeck I'm compiling the first round of test data, but so far I'm not getting as many location requests found as I'd like. You mention in one of the comments above that it might be worthwhile to make a list of, say, map sites. If I were to make a test list of sites with the clear intention of finding location requests, how can I make it random? Would it be valid to find, for example, a list of 200 map sites (not necessarily from the lists that we have), and pick randomly from that? If not, what are some valid strategies?

SebastianZimmeck commented 5 months ago

what are some valid strategies?

Just map sites would probably be too narrow of a category. There may be techniques that are map site-specific. In that case our test set would only claim that we are good at identifying locations on map sites. So, we need more categories of sites, ideally, all categories of sites that typically get people's location.

Here is a starting point: Can you give some examples of websites that use geolocation to target local customers? So, the categories mentioned there, plus map sites, plus any other category of site that you found in your tests that collect location data. Maybe, there are generic lists (Tranco, BuiltWith, ...) that have categories of sites. Compile a list out of those and then randomly pick from them. That may be an option, but maybe you have a better idea.

So, maybe our test set is comprised of two parts:

Location test set
Monetization and Tracking test set

Maybe, it even has three parts if tracking pixel, browser fingerprinting, and/or IP address collection (the Tracking categories) are also rare. Then, we would also need to do a more intricate test set construction for the Tracking categories as well. I would expect no shortage of sites with Monetization.

There are no hard rules for testing. The overall question is:

What test would convince you that the crawl results are correct? (as to lat/lon, IP address, ... )

What arguments could someone make if they wanted to punch a hole in our claim that the analysis results demonstrate our crawl results are correct? Some I can think of: too small test set, not enough breadth in the test set, i.e., not covering all the techniques that we use or types of sites, sites not randomly selected, i.e., biased towards sites we know work ... (maybe there are more).

I would think we need at least 100 sites in the test set overall and generally not less than 10 sites for each practice we detect (lat/lon, tracking pixel, ...). Anything less has likely not enough statistical power and would not convince me.

dadak-dom commented 5 months ago

I've just added the lists and data that we can use for the first go at testing. A couple things to note:

I managed to get a set where PP detected at least 10 of nearly every analysis functionality we were looking for, except for Zip Code and Lat/Long. My theory is that using the VPN makes it harder for sites to take this information, and so there's no requests with this information for PP to find. Of course, this will only be verified after testing fully, but I wanted to raise the possibility that these two analysis functions may not be possible with the setup we are going with. Just from the amount of sites, it's strange that none of them took lat/long, and yet many took region and city. I also did a quick test where I found a site I knew would take lat/long or zip code, and visited it without a vpn to make sure PP found those things. I then connected to the same site with a VPN, and PP wouldn't find lat/long or zip, but it still found Region and City. The good news is that Region and City seem to pop up quite a bit, so I believe we should have no problem testing for them.
For documentation, here was my procedure for generating the lists:

I had two pools of sites: one was a mixture of the top sites of different categories, as well as sites that I had encountered in my personal browsing. The other set was the top 100 sites from the builtwith list.
For each list, generate a set of random integers. Use the corresponding row number for the site that you'll select
Take six sites from the mixture, and six sites from the pre-compiled list, for each country that we crawl.
In theory, you should have 12 sites. However, a few of them were bound to crash, and so as long as fewer than two crashed for a given crawl, I thought it was ok, since we'll still have over 100 sites. So you should have 10-12 sites per country/state.
If you generate a random integer that corresponds to a URL that was already taken, use the next available URL.

When crawling, I made sure that I was connected to the corresponding VPN for each list, i.e. when crawling using the South Africa list, I was connected to South Africa.

SebastianZimmeck commented 5 months ago

Good progress, @dadak-dom!

I then connected to the same site with a VPN, and PP wouldn't find lat/long or zip, but it still found Region and City.

Not having lat/long would be substantial. Can you try playing around with the Mullvad VPN settings?

Can you try allowing as much as possible? Our goal is to have the sites trigger as most as possible of their tracking functionality.

Also, while I assume that the issue is not related to Firefox settings since you get lat/long with presumably the same settings in Firefox with VPN and Firefox without VPN, we should also have the Firefox settings as allowing as much as possible.

Maybe, also try a different VPN. What happens with the Wesleyan VPN, for example?

The bottom line: Try to think of ways to get the lat/long to show up.

dadak-dom commented 5 months ago

I messed around with the settings for both Firefox Nightly and Mullvad, no luck there.

I've tried crawling and regularly browsing with both Mullvad and the Wesleyan VPN. I was able to get Wesleyan VPN to show coarse location when browsing, but not when crawling. Under Mullvad, coarse/fine location never shows up.

However, when trying to figure this out, I noticed something that may be of interest. Per the Privacy Pioneer readme, the location value that PP uses to look for lat/long in HTTP requests is taken from the Geolocation API. Using the developer console, I noticed that this value doesn't change, regardless of the location that you are using for a VPN. Maybe something strange is going on with my machine, so to check what I did, I encourage anyone to try the following:

Without a VPN connection, visit any website.
Paste in the following code into your developer console:

const options = { enableHighAccuracy: true, timeout: 5000, maximumAge: 0, };

function success(pos) { const crd = pos.coords;

console.log("Your current position is:"); console.log(Latitude : ${crd.latitude}); console.log(Longitude: ${crd.longitude}); console.log(More or less ${crd.accuracy} meters.); }

function error(err) { console.warn(ERROR(${err.code}): ${err.message}); }

navigator.geolocation.getCurrentPosition(success, error, options);

Compare this value to what ipinfo.io gives you by visiting ipinfo.io (without a VPN, they should be roughly the same)
Now do steps 2 and 3 while connected to a VPN in a different country

When I do these steps, I end up with a different value for ipinfo, but the value from geolocation API stays the same (the above code should be set to not use a cached position, i.e. maximiumAge: 0) I then looked at location evidence that PP collected for crawls I did when connected to other countries. Sure enough, PP would find the region and city, because that info is provided by ipinfo. However, PP would miss the lat/long that was in the same request, most likely because the geolocation API is feeding it a different value, and so PP is looking for something else.

However, this doesn't explain why PP doesn't generate entries for coarse and fine location when crawling without a VPN. From looking at the ground truth of some small test crawls, there clearly are latitudes and longitudes of the user being sent, but for some reason PP doesn't flag them. @danielgoldelman , maybe you have some idea as to what is going on? This doesn't seem to be a VPN issue as I initially thought.

danielgoldelman commented 5 months ago

Interesting. I was having different experiences, @dadak-dom ... lat and lng seemed to be accurately obtained when performing the crawls before. Have you modified the .ext file?

dadak-dom commented 5 months ago

No, I didn't make any changes to the .ext file, @danielgoldelman . Was I supposed to?

SebastianZimmeck commented 5 months ago

Good progress, @dadak-dom!

Using the developer console, I noticed that this value doesn't change, regardless of the location that you are using for a VPN.

When I do these steps, I end up with a different value for ipinfo, but the value from geolocation API stays the same

Hm, is this even a larger issue not related to the VPN? In other words, even in the non-VPN scenario do we have a bug that the location is not properly updated? This is the first point we should check. (Maybe, going to a cafe or other place with WiFi can be used to get a second location to test.)

What is not clear to me is that when we crawled with different VPN locations for constructing our training/validation/test set, we got instances of all location types. So, I am not sure what has changed since then.

@danielgoldelman, can you look into that?

dadak-dom commented 5 months ago

I forgot to use the hashtag in my most recent commit, but @danielgoldelman and I seem to have solved the lat/long issue. Apparently the browser that selenium created did not have the geo.provider.network.url preference set, and so the extension wasn't able to evaluate a lat or long when crawling. My most recent commit to issue-9 should fix this, but this should be applied to the main crawler as well. Hopefully, this means that we can get started with gathering test data and testing.

danielgoldelman commented 5 months ago

Additionally, we have run the extension as if we were the computer, and compared our results for lat/lng with what we would expect the crawl to reasonably find. This approach worked! We used the preliminary validation set we designated earlier on, so this claim should be supported via further testing when we perform the performance metric crawl, but on first approach the crawl is working as intended for lat/lng.

SebastianZimmeck commented 5 months ago

Great! Once you think the crawler and analysis works as expected, feel free to move to the test set.

danielgoldelman commented 5 months ago

Rebased the main branch of PP into the crawler branch of PP to ensure that we are working with the correct extension when we perform the crawls

SebastianZimmeck commented 5 months ago

@danielgoldelman will come up with the testing protocol and together with @dadak-dom and @natelevinson10 (and possibly @JoeChampeau) perform the test.

JoeChampeau commented 3 months ago

@danielgoldelman and I are resolving the final issues with the crawler right now. Once we're done, @danielgoldelman will run the test crawl using the test lists and then we'll have those results.

dadak-dom commented 2 months ago

Finally, some good news regarding testing. As I brought up in last week's discussion, there was an issue where zip codes were not being identified properly when crawling. I've done some work on that front, and that's been fixed. It looks like Privacy Pioneer is finally working on the crawler (🥳). On another note, I re-worked the test list, as well as the methodology to create it, so I'll document that process here.

To create the 100 site test list, I first gathered a list of 10 different APIs that seem to take location data, whether via my own browsing or through their descriptions on BuiltWith. Then, per Kate's suggestion, I used the lists of live sites that BuiltWith claims use said APIs to construct the test list. I did this by randomly selecting 10 sites from each list, using random.org. If a randomly selected site was a redirect / faulty in some way, I used the next randomly generated number. I chose 10 different APIs so that we weren't biasing too hard towards any one format of HTTP request. Additionally, I wouldn't be too concerned over bias because (a) not every site chosen from the lists uses the API in the same way, and (b) some of the sites chosen use more than one API for geolocation. Ultimately, in order to get enough examples of location data being taken, some bias needed to occur (past experience tells us that simply selecting random sites tends to not yield that much location evidence), though I believe that the steps I've taken have found a good compromise.

SebastianZimmeck commented 2 months ago

Excellent, @dadak-dom! Let's discuss the details tomorrow.

SebastianZimmeck commented 1 month ago

A few thoughts on how we test the performance of Privacy Pioneer's analysis when used in a web crawl and with a cloud VM.

The starting point is that we know Privacy Pioneer's analysis performance when used without a crawler and without a cloud VM. We measured that in our paper. Now, the two elements --- crawler and cloud VM --- can introduce errors. How can we measure that? To which extent should we measure that?

As I see it at the moment, there are two points:

We can analyze Privacy Pioneer's performance with crawler and cloud VM just in the normal way. In other words, we can connect to a cloud VM, run the crawl, capture the analysis results, capture all data that Privacy Pioneer had available, and compare the analysis results against the results of just Privacy Pioneer running on the data. There needs to be some manual checking, but that should be ultimately easy to do. This is what @danielgoldelman and @dadak-dom were describing.
The harder question, about which I am not sure, is to which extent we should also measure Privacy Pioneer with crawler and cloud VM against Privacy Pioneer without those, i.e., in both instances Privacy Pioneer will necessarily need to run on different datasets necessarily causing different results. At the moment, I am leaning towards doing such analysis because we know from @danielgoldelman's findings that there are differences in the data when using a crawler and cloud VM. So, if we claim that we capture a realistic situation a user would encounter in a particular country, we would need to say that with cloud VM and crawl is more or less equal to without cloud VM and crawl.

On the second point, I also find it plausible to say that with cloud VM and crawl the situation is a lower bound. That is at least what people are exposed to in real life. But this assumes that crawl and cloud VM leads to just less data as opposed to incorrect data. So, that speaks for doing the second analysis to identify incorrect results.

Given that, the issue for the second analysis becomes that every run of the test set is different. So, a site may load different ad networks or load them at different times, for example. Thus, comparing two different runs will often result in two different result sets even when no error occurs. I am not sure to which extent we can distinguish incorrect from just fewer results (e.g., cloud VM introduced wrong location is probably more incorrect than just omitted).

One suggestion could be:

Run Privacy Pioneer on the test set without cloud VM and crawler and capture the results
Run Privacy Pioneer on the test set without cloud VM and crawler and capture the results a second time. This second run will give a sense of the delta between two different runs (there could be even a third, fourth, etc. run; the more runs, the clearer we understand the variance between different runs)
Run Privacy Pioneer on the test set with cloud VM and crawler with cloud VM from our real location and capture the results and compare those against the previous runs without cloud VM and crawler (possibly, also do this run multiple times). In essence, does this run look similar in terms of variance of error rates compared to the previous runs without crawl and cloud VM? If so, we are good.

Having fewer analysis results as a matter of having less data to analyze is not an issue if we say we are aiming for a lower bound. However, having incorrect analysis results would be problematic.

Also, I think we would not need to test (extensively) for different cloud VMs (i.e., run Privacy Pioneer on the test set with cloud VM does not necessarily mean run on every country/state cloud VM). If we have reason to believe that it works for one cloud VM location, from a systematic viewpoint, there is no reason to mistrust the other cloud VM locations as they follow the same systematic. This would only leave open the possibility that there is a location-specific problem with one particular cloud VM. I think that would be proper to state as an assumption. We could also try running on different cloud VM locations to see how it goes, though.

Any thoughts? Is this thinking correct? Other ideas?

Edit: Changed "VPN" to "cloud VM"

dadak-dom commented 1 month ago

You raise some good points, @SebastianZimmeck . Nate, Daniel, and I discussed this a little, and we had some thoughts that we'd like your opinion on. From your comment, it seems like there's two approaches that make the most sense to me.

The first option is essentially what you were describing: We take the test set that I constructed and perform a crawl both on the Cloud VM as well as locally. We then shut down the VM, turn it on again, and run the test list again, performing the crawl 3 times for each machine (local and virtual). By repeating the crawl, we mitigate the chances of anomalies in our test data that occurred (site temporarily down, etc.). Performing these test crawls would give us insight into how the VM infrastructure affects the evidence that Privacy Pioneer finds.

The second option would also entail 3 test crawls, but instead of crawling locally, someone manually clicks through the test list (preferably a subset, since this would take much longer than option one). Both the actual crawl and the manual "crawl" would be performed on the VM. This would give us insight into the effect that the crawler infrastructure has on the evidence that Privacy Pioneer finds. Alternatively, we could perform the manual crawl locally and verify both points (effect of VM, effect of crawl) at once. Once we gather this data, we should have a better idea of whether or not crawling is a good idea. I could take the lead on the manual crawl if that's the route that we go with, asking for help as needed.

What are your thoughts on this, @SebastianZimmeck ? (or anyone?)

SebastianZimmeck commented 1 month ago

Essentially we need to compare:

Privacy Pioneer vs Privacy Pioneer + VM + Crawl infrastructure

So, there are two error sources:

VM
Crawl infrastructure

It certainly would be more systematic to evaluate each source on its own (e.g., to pinpoint error rates to the sources).

Your second option, @dadak-dom, is going in the systematic direction. Here is a variation:

Manual crawl without VM vs manual crawl on VM -> evaluates impact of crawl component (i.e., manual crawl == normal Privacy Pioneer usage)
Real crawl locally vs real crawl on VM -> evaluates impact of VM

Would that make sense? Maybe, do each three times to account for normal fluctuation of loaded site elements.

Generally, what you are saying makes a lot of sense to me, @dadak-dom.

dadak-dom commented 1 month ago

Here is a variation:

Your suggestion makes sense to me, @SebastianZimmeck . If you think it's alright, I'll start collecting the testing data as soon as possible 👍

SebastianZimmeck commented 1 month ago

Sounds good! Feel free to go ahead, @dadak-dom! And we can also discuss further in our meeting this week.

SebastianZimmeck commented 1 month ago

As discussed, @dadak-dom will begin to evaluate Privacy Pioneer vs Privacy Pioneer VM.

dadak-dom commented 1 week ago

I have essentially finished the comparisons between Privacy Pioneer and Privacy Pioneer with a VM. Along with my results, I'm going to detail how I went about this analysis, and my reasoning for setting it up in this way. I think this will help get @atlasharry up to speed, and also hopefully will clear up any confusion that anyone may have.

Goals

With this analysis, I set out to uncover whether or not introducing a VM into our web crawl has a significant impact on any data that we would collect, as brought up here. Ideally, we would want to see that, for every category, the number of requests flagged by Privacy Pioneer on a VM is either the same or lower than what would be discovered via the control group (no VM and no crawler).

Setup / Procedure

Based on the goals outlined above, I created a relatively simple, although time-consuming, procedure. From a bird's eye view, I essentially posed as a regular user, manually clicking through the 100-site test list I had created earlier. Here's the process in a little more detail: (Preliminary steps: Make sure that you have cloned into both the crawler and the extension repo. Also make sure that you set the boolean flags that put Privacy Pioneer into crawl mode to true. This way, we can overwrite both the incorrect data that IPinfo sends, as well as the lat/long coordinates on the cloud)

Identify a local machine to use (i.e., my laptop).
Boot up the VM from the location that we are testing against.
Both locally and on the cloud, launch two instances of the terminal
In the first instance of the terminal, launch Privacy Pioneer in development mode, as described here
In the second instance of the terminal, launch the rest-api.
Now that Privacy Pioneer has been opened with a new instance of Firefox, we can input the location data that we need to overwrite. For the local machine, just use your real location. For the VM, look up your coordinates and zip code based on the city that the VM is based in.
Now, we can start "crawling". Copy-paste the first URL from the test list (both locally and virtually).
Wait until the URL is printed by the terminal running the rest-api, which should take around a minute. This indicates that data has been recorded, and we're ready to move on to the next site.
Repeat steps 7-9 until the entire list has been "crawled".
In case of any crashes, delete any data gathered from the last site, restart Privacy Pioneer and the rest-api, and start crawling again from where you left off.

After running this process three times, we should have plenty of data for each of the 100 sites. However, since Privacy Pioneer records data in JSON format, I needed a way to meaningfully import the data I collected into a spreadsheet program. Thus, I created a short Python script that would take in all of the individual JSON entries and create a CSV file detailing how many pieces of evidence were acquired for each category for every site. Here's an example.

Results

Since this is essentially an experiment involving matched pairs, I thought it would make sense to analyze my findings using a paired t-test. I would first find the average number of requests per category per website. In other words, how many advertising/ipAddress/whatever requests would I expect from example.com when "crawling" locally? How about on a VM? Then, I would run a paired t-test, treating the local data as a control group and the VM data as the treatment group to find a p-value for every category. After our last discussion, I also decided to run the experiment in two additional locations (Columbus and Iowa). Here are the overall results, with a link to the entire Google Sheet if anyone is interested.

I have marked in green and red any stastically significant results. Based on this data, it seems like the only categories with any major differences to speak of are coarseLocation, fineLocation, region, and zipCode. For the first three elements, I think that these results are acceptable, since they indicate that there is statistical evidence for less data being generated on the cloud. It's not ideal, but we can work with it. The more interesting point, though, is zipCode, which seems to come up more in the cloud for 2/3 locations. This may seem strange, but I have some possible explanations for it.

Since nearly all instances of location gathering are done through the IP, it's not unlikely that this is a shortcoming of where I'm crawling from, rather than Privacy Pioneer underperforming. I've noticed for a while now that, when I'm connected to the campus wifi, the zip code can switch between two different values. I've looked at the data that passed through Privacy Pioneer, and it seems like there's no real difference in the amount of zip codes being taken; what does change, however, is the zip code that websites are taking. Also, notice the averages in the categories for coarseLocation and zipCode. Virtually, the average number of requests per site for these two categories doesn't seem to differ by more than +/- 0.05. Locally, however, they differ by 0.2. This (and also my own experience of inspecting these requests) indicates that requests containing coordinate data oftentimes also include zip codes. These results, however, support my hypothesis of zip code fluctuation, since this would mean that certain instances of zip codes would go unnoticed, while coordinate data would still get uncovered, leading to the current results.

To try and remedy this, I could perform the local crawl again, this time from a different location with a static IP. I could try my local library, but I guess that depends on what you think I should do next, @SebastianZimmeck.

Overall, does this make sense? Am I going in the right direction? Is this like what you had in mind, @SebastianZimmeck? Would this convince you (or anyone else) that collecting this data on a VM is valid?

SebastianZimmeck commented 1 week ago

Nice work, @dadak-dom!

As discussed, if you go a bit deeper, we will learn more about any inconsistent results.

dadak-dom commented 1 week ago

I have been looking into what exactly causes the inconsistencies between runs on a VM (without a crawler), and here's what I've found. Generally, there seem to be two main causes for why a certain run would have a different number of requests. The simplest explanation, which I've confirmed to be the case for at least a portion of the inconsistencies, is that different site loads result in different requests being sent. This is especially common with requests in the monetization category. Sometimes, I would be able to see that the request was supposed to come in, but something on the website's end would go wrong, resulting in the request erroring out. Additionally, certain requests would be counted in more than one category, so one request not showing up could potentially affect other categories as well. I think it's safe to say that for these types of inconsistencies, there's not much that we can do.

However, the location category seemed to have an additional inconsistency. I traced Privacy Pioneer's output in the developer console and I found that, for certain geolocation APIs, the ML model would return inconsistent results, e.g., on one site load a request wouldn't get flagged, while a second load would result in region and zip code getting flagged. This would happen even though the request snippets appeared identical. I'm not very familiar with the setup of the ML, so I'm not sure whether this behavior is expected or not. Could it be due to this known issue? I'd imagine it doesn't help.

So, I've identified two sources of inconsistency when comparing one VM site visit to another. However, I'm a little confused about where I'm supposed to go from here. I'm not sure if either of these issues can really be "fixed" in a realistic time frame. Should I try to calculate the precision and recall for VM vs. Non VM to gauge whether or not there's a performance difference? Maybe just for the VM? Maybe there's a different direction I should take? Any input would be appreciated.

Edit: For the first point, I meant that I am confident that this is a site issue.

dadak-dom commented 1 week ago

Update: It looks like my initial thoughts regarding monetization / tracking were incorrect... I've been able to find instances where Privacy Pioneer failed to flag a request. I'm currently looking into diagnosing the issue with the extension. My first thoughts are that this is related to how Privacy Pioneer takes in the requests to begin with, but I'll update more as I get more answers.

SebastianZimmeck commented 1 week ago

OK, good that you confirmed.

SebastianZimmeck commented 5 days ago

@dadak-dom found that the likely reason for the discrepancy in location and other analysis results is the exclusion of a resource type in the Privacy Pioneer analysis. Per @dadak-dom's PR:

Turns out that our HTTP request listener was filtering out requests that were initiated using the Beacon API (as opposed to XHR or Fetch). Certain requests would occasionally use Fetch on one run, but Beacon on the other, and so the extension would miss the latter requests completely.

@dadak-dom will do remaining testing to be sure and then add a small explanation in the Known Issues section of the Privacy Pioneer readme that can link to a more detailed explanation here.

privacy-tech-lab / privacy-pioneer-web-crawler