privacy-tech-lab / privacy-pioneer-web-crawler

Web crawler for detecting websites' data collection and sharing practices at scale using Privacy Pioneer
https://privacytechlab.org/
MIT License
0 stars 0 forks source link

Identify problem with Zip Code #17

Closed dadak-dom closed 3 months ago

dadak-dom commented 4 months ago

As discussed in today's meeting, @natelevinson10 and I will look into identifying the problem (and possible solution) for zip codes not showing up when crawling in other countries. First step is probably to look into IPinfo, per @danielgoldelman

SebastianZimmeck commented 4 months ago

Great! And as @danielgoldelman said, the issue may be that IPInfo is imprecise and provides ZIP code that does not match the one collected/shared by the visited websites.

dadak-dom commented 3 months ago

Update: Will keep this issue open for now, as this was the motivation for switching to the Google Cloud VMs, with hard-coded latitude, longitude, and zip codes. Once we get up and running with testing, I'll document what values I used for each location and how I arrived at said values.

dadak-dom commented 3 months ago

Here are the zip code values that I've been using in the locations that I've set up. @danielgoldelman , if you could list the rest of them when you have the chance, that'd be great, so that we could refer back to them here. These values were obtained by looking at the ground truth of sites used during validation and identifying a common zip code. In other words, if multiple sites believe that we are at a certain zip, we assume that's our location and go from there.

India: 110001 Canada: M5A 1N7 Oregon: 97058 Germany: 10115

SebastianZimmeck commented 3 months ago

@danielgoldelman writes in how this works. The ZIP codes are changing.

danielgoldelman commented 3 months ago

Zip codes are not static due to the fact that the physical server location is not fixed, meaning that we do not always get the same exact server on each load. Thus, we need to reassign the zip code (and lat/lng, city, region) each time we load the VMs to ensure that we are working with the correct location details.