Open SebastianZimmeck opened 6 months ago
Where are we with the testing protocol, @danielgoldelman?
Preliminary testing protocol
For pp data:
For all http request data:
Now to bring both together:
@danielgoldelman, can you reformat and reorder your comment? The order is very hard to follow, there are multiple numbers 1 and 2 after each other, etc.
@SebastianZimmeck sorry, the original comment was written on GitHub mobile, so formatting was hard to check. Changes made above.
@danielgoldelman and @dadak-dom, please carefully read section 3.2.2 and 3.3 (in particular, 3.3.2) of our paper. We can re-use much of the approach there. I do not think that we need an annotation of the ground truth, but both of you should check the ground truth (for whatever our definition of ground truth is) and come to the same conclusion.
We have to create a testing protocol along the lines of the following:
Select the set of analysis functionality that we are testing and how
Pick a set of websites to test
Running the test
Ground truth analysis
These question cannot be answered in the abstract. @danielgoldelman and @dadak-dom, please play around with some sites for each analysis functionality and come up with a protocol to analyze it. For each functionality you need to be convinced that you can reliably identify true positives (and exclude false positives and false negatives). In other words, please do some validation tests.
Who is going to run the test?
Would it make sense if @JoeChampeau runs the test, and then hands the data over to Daniel and me? I thought it would make sense since that's the computer that we will use to run the actual crawl. That way, we could avoid any potential issues arising when switching between windows and mac. Just a thought.
@SebastianZimmeck , the way I understand it, we will end up with three different site lists for each country (please correct me if I'm wrong)
Would it make sense if @JoeChampeau runs the test, and then hands the data over to Daniel and me?
It certainly makes sense, but that would depend on if @JoeChampeau has time as the task was originally @danielgoldelman's. (Given our slow speed, the point may more or less resolve itself since we will be all back on campus soon anyways.)
(please correct me if I'm wrong)
All correct.
but can the test set (and/or the validation) be derived from the actual crawl list?
Yes, the validation and test set can be derived from the crawl list.
I have added my proposed crawl testing lists to the branch connected with this issue (issue-9). Here was my procedure:
With point 5 I tried my best to include a fair share of sites that take locations, as monetization was easy to come by. @SebastianZimmeck let me know if any changes need to be made.
OK, sounds good!
So, our test set has a total of 120 sites? For each of the 10 countries/states 6 sites from the general list and 6 from the country-specific list.
With point 5 I tried my best to include a fair share of sites that take locations
How did you make the guess that a site takes locations?
So, our test set has a total of 120 sites? For each of the 10 countries/states 6 sites from the general list and 6 from the country-specific list.
Yes, 120 sites total.
How did you make the guess that a site takes locations?
A couple of ways, e.g. visiting the site and seeing if it requests the location from the browser, or if PP detects a location, or if I know from my own browsing that the site would take locations.
OK, sounds good!
Feel free to go ahead with that test set then. As we discussed yesterday, maybe the performance is good. Otherwise we call that set a validation set, pick a new test set, and repeat the test (after fixing any shortcomings with the crawler and/or extension).
One important point, the PP analysis needs to be set up exactly as it would be in the real crawl, i.e., with VPN, a crawl not just the extension. Though, it does not need to be on the crawl computer.
One more thing: I noticed this morning that there are a lot of sites in the general list that redirect to sites that are already on the list. Can't believe I didn't catch that sooner, so I'll fix that ASAP. Just to be safe, I'll also redo the general list part of the test set.
Great!
@SebastianZimmeck I'm compiling the first round of test data, but so far I'm not getting as many location requests found as I'd like. You mention in one of the comments above that it might be worthwhile to make a list of, say, map sites. If I were to make a test list of sites with the clear intention of finding location requests, how can I make it random? Would it be valid to find, for example, a list of 200 map sites (not necessarily from the lists that we have), and pick randomly from that? If not, what are some valid strategies?
what are some valid strategies?
Just map sites would probably be too narrow of a category. There may be techniques that are map site-specific. In that case our test set would only claim that we are good at identifying locations on map sites. So, we need more categories of sites, ideally, all categories of sites that typically get people's location.
Here is a starting point: Can you give some examples of websites that use geolocation to target local customers? So, the categories mentioned there, plus map sites, plus any other category of site that you found in your tests that collect location data. Maybe, there are generic lists (Tranco, BuiltWith, ...) that have categories of sites. Compile a list out of those and then randomly pick from them. That may be an option, but maybe you have a better idea.
So, maybe our test set is comprised of two parts:
Maybe, it even has three parts if tracking pixel, browser fingerprinting, and/or IP address collection (the Tracking categories) are also rare. Then, we would also need to do a more intricate test set construction for the Tracking categories as well. I would expect no shortage of sites with Monetization.
There are no hard rules for testing. The overall question is:
What test would convince you that the crawl results are correct? (as to lat/lon, IP address, ... )
What arguments could someone make if they wanted to punch a hole in our claim that the analysis results demonstrate our crawl results are correct? Some I can think of: too small test set, not enough breadth in the test set, i.e., not covering all the techniques that we use or types of sites, sites not randomly selected, i.e., biased towards sites we know work ... (maybe there are more).
I would think we need at least 100 sites in the test set overall and generally not less than 10 sites for each practice we detect (lat/lon, tracking pixel, ...). Anything less has likely not enough statistical power and would not convince me.
I've just added the lists and data that we can use for the first go at testing. A couple things to note:
When crawling, I made sure that I was connected to the corresponding VPN for each list, i.e. when crawling using the South Africa list, I was connected to South Africa.
Good progress, @dadak-dom!
I then connected to the same site with a VPN, and PP wouldn't find lat/long or zip, but it still found Region and City.
Not having lat/long would be substantial. Can you try playing around with the Mullvad VPN settings?
Can you try allowing as much as possible? Our goal is to have the sites trigger as most as possible of their tracking functionality.
Also, while I assume that the issue is not related to Firefox settings since you get lat/long with presumably the same settings in Firefox with VPN and Firefox without VPN, we should also have the Firefox settings as allowing as much as possible.
Maybe, also try a different VPN. What happens with the Wesleyan VPN, for example?
The bottom line: Try to think of ways to get the lat/long to show up.
I messed around with the settings for both Firefox Nightly and Mullvad, no luck there.
I've tried crawling and regularly browsing with both Mullvad and the Wesleyan VPN. I was able to get Wesleyan VPN to show coarse location when browsing, but not when crawling. Under Mullvad, coarse/fine location never shows up.
However, when trying to figure this out, I noticed something that may be of interest. Per the Privacy Pioneer readme, the location value that PP uses to look for lat/long in HTTP requests is taken from the Geolocation API. Using the developer console, I noticed that this value doesn't change, regardless of the location that you are using for a VPN. Maybe something strange is going on with my machine, so to check what I did, I encourage anyone to try the following:
const options = { enableHighAccuracy: true, timeout: 5000, maximumAge: 0, };
function success(pos) { const crd = pos.coords;
console.log("Your current position is:");
console.log(Latitude : ${crd.latitude}
);
console.log(Longitude: ${crd.longitude}
);
console.log(More or less ${crd.accuracy} meters.
);
}
function error(err) {
console.warn(ERROR(${err.code}): ${err.message}
);
}
navigator.geolocation.getCurrentPosition(success, error, options);
When I do these steps, I end up with a different value for ipinfo, but the value from geolocation API stays the same (the above code should be set to not use a cached position, i.e. maximiumAge: 0) I then looked at location evidence that PP collected for crawls I did when connected to other countries. Sure enough, PP would find the region and city, because that info is provided by ipinfo. However, PP would miss the lat/long that was in the same request, most likely because the geolocation API is feeding it a different value, and so PP is looking for something else.
However, this doesn't explain why PP doesn't generate entries for coarse and fine location when crawling without a VPN. From looking at the ground truth of some small test crawls, there clearly are latitudes and longitudes of the user being sent, but for some reason PP doesn't flag them. @danielgoldelman , maybe you have some idea as to what is going on? This doesn't seem to be a VPN issue as I initially thought.
Interesting. I was having different experiences, @dadak-dom ... lat and lng seemed to be accurately obtained when performing the crawls before. Have you modified the .ext file?
No, I didn't make any changes to the .ext file, @danielgoldelman . Was I supposed to?
Good progress, @dadak-dom!
Using the developer console, I noticed that this value doesn't change, regardless of the location that you are using for a VPN.
When I do these steps, I end up with a different value for ipinfo, but the value from geolocation API stays the same
Hm, is this even a larger issue not related to the VPN? In other words, even in the non-VPN scenario do we have a bug that the location is not properly updated? This is the first point we should check. (Maybe, going to a cafe or other place with WiFi can be used to get a second location to test.)
What is not clear to me is that when we crawled with different VPN locations for constructing our training/validation/test set, we got instances of all location types. So, I am not sure what has changed since then.
@danielgoldelman, can you look into that?
I forgot to use the hashtag in my most recent commit, but @danielgoldelman and I seem to have solved the lat/long issue. Apparently the browser that selenium created did not have the geo.provider.network.url
preference set, and so the extension wasn't able to evaluate a lat or long when crawling. My most recent commit to issue-9 should fix this, but this should be applied to the main crawler as well. Hopefully, this means that we can get started with gathering test data and testing.
Additionally, we have run the extension as if we were the computer, and compared our results for lat/lng with what we would expect the crawl to reasonably find. This approach worked! We used the preliminary validation set we designated earlier on, so this claim should be supported via further testing when we perform the performance metric crawl, but on first approach the crawl is working as intended for lat/lng.
Great! Once you think the crawler and analysis works as expected, feel free to move to the test set.
Rebased the main branch of PP into the crawler branch of PP to ensure that we are working with the correct extension when we perform the crawls
@danielgoldelman will come up with the testing protocol and together with @dadak-dom and @natelevinson10 (and possibly @JoeChampeau) perform the test.
@danielgoldelman and I are resolving the final issues with the crawler right now. Once we're done, @danielgoldelman will run the test crawl using the test lists and then we'll have those results.
Finally, some good news regarding testing. As I brought up in last week's discussion, there was an issue where zip codes were not being identified properly when crawling. I've done some work on that front, and that's been fixed. It looks like Privacy Pioneer is finally working on the crawler (🥳). On another note, I re-worked the test list, as well as the methodology to create it, so I'll document that process here.
To create the 100 site test list, I first gathered a list of 10 different APIs that seem to take location data, whether via my own browsing or through their descriptions on BuiltWith. Then, per Kate's suggestion, I used the lists of live sites that BuiltWith claims use said APIs to construct the test list. I did this by randomly selecting 10 sites from each list, using random.org. If a randomly selected site was a redirect / faulty in some way, I used the next randomly generated number. I chose 10 different APIs so that we weren't biasing too hard towards any one format of HTTP request. Additionally, I wouldn't be too concerned over bias because (a) not every site chosen from the lists uses the API in the same way, and (b) some of the sites chosen use more than one API for geolocation. Ultimately, in order to get enough examples of location data being taken, some bias needed to occur (past experience tells us that simply selecting random sites tends to not yield that much location evidence), though I believe that the steps I've taken have found a good compromise.
Here are the lists in question, each one having been accessed on April 24, 2024: https://trends.builtwith.com/websitelist/AB-Tasty https://trends.builtwith.com/websitelist/Intellimize https://trends.builtwith.com/websitelist/Dynamic-Yield https://trends.builtwith.com/websitelist/Permutive https://trends.builtwith.com/websitelist/securiti https://trends.builtwith.com/websitelist/IPinfo https://trends.builtwith.com/websitelist/ipdata https://trends.builtwith.com/websitelist/IP-API https://trends.builtwith.com/websitelist/Ipregistry https://trends.builtwith.com/websitelist/Rebuy (I will also put up the downloads of the list onto the Google Drive)
Excellent, @dadak-dom! Let's discuss the details tomorrow.
A few thoughts on how we test the performance of Privacy Pioneer's analysis when used in a web crawl and with a cloud VM.
The starting point is that we know Privacy Pioneer's analysis performance when used without a crawler and without a cloud VM. We measured that in our paper. Now, the two elements --- crawler and cloud VM --- can introduce errors. How can we measure that? To which extent should we measure that?
As I see it at the moment, there are two points:
On the second point, I also find it plausible to say that with cloud VM and crawl the situation is a lower bound. That is at least what people are exposed to in real life. But this assumes that crawl and cloud VM leads to just less data as opposed to incorrect data. So, that speaks for doing the second analysis to identify incorrect results.
Given that, the issue for the second analysis becomes that every run of the test set is different. So, a site may load different ad networks or load them at different times, for example. Thus, comparing two different runs will often result in two different result sets even when no error occurs. I am not sure to which extent we can distinguish incorrect from just fewer results (e.g., cloud VM introduced wrong location is probably more incorrect than just omitted).
One suggestion could be:
Having fewer analysis results as a matter of having less data to analyze is not an issue if we say we are aiming for a lower bound. However, having incorrect analysis results would be problematic.
Also, I think we would not need to test (extensively) for different cloud VMs (i.e., run Privacy Pioneer on the test set with cloud VM does not necessarily mean run on every country/state cloud VM). If we have reason to believe that it works for one cloud VM location, from a systematic viewpoint, there is no reason to mistrust the other cloud VM locations as they follow the same systematic. This would only leave open the possibility that there is a location-specific problem with one particular cloud VM. I think that would be proper to state as an assumption. We could also try running on different cloud VM locations to see how it goes, though.
Any thoughts? Is this thinking correct? Other ideas?
Edit: Changed "VPN" to "cloud VM"
You raise some good points, @SebastianZimmeck . Nate, Daniel, and I discussed this a little, and we had some thoughts that we'd like your opinion on. From your comment, it seems like there's two approaches that make the most sense to me.
The first option is essentially what you were describing: We take the test set that I constructed and perform a crawl both on the Cloud VM as well as locally. We then shut down the VM, turn it on again, and run the test list again, performing the crawl 3 times for each machine (local and virtual). By repeating the crawl, we mitigate the chances of anomalies in our test data that occurred (site temporarily down, etc.). Performing these test crawls would give us insight into how the VM infrastructure affects the evidence that Privacy Pioneer finds.
The second option would also entail 3 test crawls, but instead of crawling locally, someone manually clicks through the test list (preferably a subset, since this would take much longer than option one). Both the actual crawl and the manual "crawl" would be performed on the VM. This would give us insight into the effect that the crawler infrastructure has on the evidence that Privacy Pioneer finds. Alternatively, we could perform the manual crawl locally and verify both points (effect of VM, effect of crawl) at once. Once we gather this data, we should have a better idea of whether or not crawling is a good idea. I could take the lead on the manual crawl if that's the route that we go with, asking for help as needed.
What are your thoughts on this, @SebastianZimmeck ? (or anyone?)
Essentially we need to compare:
Privacy Pioneer
vs Privacy Pioneer + VM + Crawl infrastructure
So, there are two error sources:
It certainly would be more systematic to evaluate each source on its own (e.g., to pinpoint error rates to the sources).
Your second option, @dadak-dom, is going in the systematic direction. Here is a variation:
Manual crawl without VM
vs manual crawl on VM
-> evaluates impact of crawl component (i.e., manual crawl == normal Privacy Pioneer usage)Real crawl locally
vs real crawl on VM
-> evaluates impact of VMWould that make sense? Maybe, do each three times to account for normal fluctuation of loaded site elements.
Generally, what you are saying makes a lot of sense to me, @dadak-dom.
Here is a variation:
Your suggestion makes sense to me, @SebastianZimmeck . If you think it's alright, I'll start collecting the testing data as soon as possible 👍
Sounds good! Feel free to go ahead, @dadak-dom! And we can also discuss further in our meeting this week.
As discussed, @dadak-dom will begin to evaluate Privacy Pioneer vs Privacy Pioneer VM.
I have essentially finished the comparisons between Privacy Pioneer and Privacy Pioneer with a VM. Along with my results, I'm going to detail how I went about this analysis, and my reasoning for setting it up in this way. I think this will help get @atlasharry up to speed, and also hopefully will clear up any confusion that anyone may have.
With this analysis, I set out to uncover whether or not introducing a VM into our web crawl has a significant impact on any data that we would collect, as brought up here. Ideally, we would want to see that, for every category, the number of requests flagged by Privacy Pioneer on a VM is either the same or lower than what would be discovered via the control group (no VM and no crawler).
Based on the goals outlined above, I created a relatively simple, although time-consuming, procedure. From a bird's eye view, I essentially posed as a regular user, manually clicking through the 100-site test list I had created earlier. Here's the process in a little more detail: (Preliminary steps: Make sure that you have cloned into both the crawler and the extension repo. Also make sure that you set the boolean flags that put Privacy Pioneer into crawl mode to true. This way, we can overwrite both the incorrect data that IPinfo sends, as well as the lat/long coordinates on the cloud)
After running this process three times, we should have plenty of data for each of the 100 sites. However, since Privacy Pioneer records data in JSON format, I needed a way to meaningfully import the data I collected into a spreadsheet program. Thus, I created a short Python script that would take in all of the individual JSON entries and create a CSV file detailing how many pieces of evidence were acquired for each category for every site. Here's an example.
Since this is essentially an experiment involving matched pairs, I thought it would make sense to analyze my findings using a paired t-test. I would first find the average number of requests per category per website. In other words, how many advertising/ipAddress/whatever requests would I expect from example.com when "crawling" locally? How about on a VM? Then, I would run a paired t-test, treating the local data as a control group and the VM data as the treatment group to find a p-value for every category.
After our last discussion, I also decided to run the experiment in two additional locations (Columbus and Iowa). Here are the overall results, with a link to the entire Google Sheet if anyone is interested.
I have marked in green and red any stastically significant results. Based on this data, it seems like the only categories with any major differences to speak of are coarseLocation, fineLocation, region, and zipCode. For the first three elements, I think that these results are acceptable, since they indicate that there is statistical evidence for less data being generated on the cloud. It's not ideal, but we can work with it. The more interesting point, though, is zipCode, which seems to come up more in the cloud for 2/3 locations. This may seem strange, but I have some possible explanations for it.
Since nearly all instances of location gathering are done through the IP, it's not unlikely that this is a shortcoming of where I'm crawling from, rather than Privacy Pioneer underperforming. I've noticed for a while now that, when I'm connected to the campus wifi, the zip code can switch between two different values. I've looked at the data that passed through Privacy Pioneer, and it seems like there's no real difference in the amount of zip codes being taken; what does change, however, is the zip code that websites are taking. Also, notice the averages in the categories for coarseLocation and zipCode. Virtually, the average number of requests per site for these two categories doesn't seem to differ by more than +/- 0.05. Locally, however, they differ by 0.2. This (and also my own experience of inspecting these requests) indicates that requests containing coordinate data oftentimes also include zip codes. These results, however, support my hypothesis of zip code fluctuation, since this would mean that certain instances of zip codes would go unnoticed, while coordinate data would still get uncovered, leading to the current results.
To try and remedy this, I could perform the local crawl again, this time from a different location with a static IP. I could try my local library, but I guess that depends on what you think I should do next, @SebastianZimmeck.
Overall, does this make sense? Am I going in the right direction? Is this like what you had in mind, @SebastianZimmeck? Would this convince you (or anyone else) that collecting this data on a VM is valid?
Nice work, @dadak-dom!
As discussed, if you go a bit deeper, we will learn more about any inconsistent results.
I have been looking into what exactly causes the inconsistencies between runs on a VM (without a crawler), and here's what I've found. Generally, there seem to be two main causes for why a certain run would have a different number of requests. The simplest explanation, which I've confirmed to be the case for at least a portion of the inconsistencies, is that different site loads result in different requests being sent. This is especially common with requests in the monetization category. Sometimes, I would be able to see that the request was supposed to come in, but something on the website's end would go wrong, resulting in the request erroring out. Additionally, certain requests would be counted in more than one category, so one request not showing up could potentially affect other categories as well. I think it's safe to say that for these types of inconsistencies, there's not much that we can do.
However, the location category seemed to have an additional inconsistency. I traced Privacy Pioneer's output in the developer console and I found that, for certain geolocation APIs, the ML model would return inconsistent results, e.g., on one site load a request wouldn't get flagged, while a second load would result in region and zip code getting flagged. This would happen even though the request snippets appeared identical. I'm not very familiar with the setup of the ML, so I'm not sure whether this behavior is expected or not. Could it be due to this known issue? I'd imagine it doesn't help.
So, I've identified two sources of inconsistency when comparing one VM site visit to another. However, I'm a little confused about where I'm supposed to go from here. I'm not sure if either of these issues can really be "fixed" in a realistic time frame. Should I try to calculate the precision and recall for VM vs. Non VM to gauge whether or not there's a performance difference? Maybe just for the VM? Maybe there's a different direction I should take? Any input would be appreciated.
Edit: For the first point, I meant that I am confident that this is a site issue.
Update: It looks like my initial thoughts regarding monetization / tracking were incorrect... I've been able to find instances where Privacy Pioneer failed to flag a request. I'm currently looking into diagnosing the issue with the extension. My first thoughts are that this is related to how Privacy Pioneer takes in the requests to begin with, but I'll update more as I get more answers.
OK, good that you confirmed.
@dadak-dom found that the likely reason for the discrepancy in location and other analysis results is the exclusion of a resource type in the Privacy Pioneer analysis. Per @dadak-dom's PR:
Turns out that our HTTP request listener was filtering out requests that were initiated using the Beacon API (as opposed to XHR or Fetch). Certain requests would occasionally use Fetch on one run, but Beacon on the other, and so the extension would miss the latter requests completely.
@dadak-dom will do remaining testing to be sure and then add a small explanation in the Known Issues section of the Privacy Pioneer readme that can link to a more detailed explanation here.
Before we start the crawl, we need to test the crawler's performance. So, we need to compare the manually observed groundtruth with the analysis results. We probably need a 100-site test set.
(@JoeChampeau and @jjeancharles feel free to participate here as well.)