Geolocate user IP addresses when presenting them in UI

di commented 4 years ago

What's the problem this feature will solve? Currently in the PyPI logged-in UI, we show the IP address that performed certain actions to the user:

I don't know my own IP offhand. Especially if there are multiple different IPs listed here, I would need to manually look up the approximate location where these came from to get an idea of whether they were actually me or not.

Describe the solution you'd like It would be nice if PyPI also showed me an (approximate) location for any given IP address as well, so I could easily visually filter ones that seem incorrect, e.g.:

Event	Date / time	IP address
Logged in	less than 10 seconds ago	11.22.11.22 (Austin TX USA)
Logged in	June 22, 2020	22.33.22.33 (Austin TX USA)
Logged in	June 19, 2020	44.55.44.55 (Timbuktu, Mali)
Logged in	June 19, 2020	66.77.66.77 (Austin TX, USA)

Additional context This shouldn't require external API calls. Using something like https://pypi.org/project/geoip2/ with an embedded database like https://dev.maxmind.com/geoip/geoip2/geolite2/ would probably work.

Ideally this would be determined on the fly and not stored anywhere (e.g. along with the IP address), so if we someday replaced the mechanism with something more precise (or just updated the embedded DB) the updates would be immediately reflected.

Todo list

[x] Use Fastly's geolocation services to determine geographic location at edge
[x] Hash & salt IP addresses at edge, pass those to our backends/logs (populate a X-PyPI-Hashed-IP header.
[x] Replace IP addresses in gunicorn logs
[x] Begin storing hashed IPs everywhere for all events
[x] Replace IP addresses in the user-facing UI (user/project events) with corresponding geolocation data
[x] ~~Replace IP addresses in journals with corresponding hashed IP~~
[x] Drop submitted_from column from journals table (duplicated in ProjectEvent)
[ ] For all events tables, change IP columns to CIText, retroactively hash IP addresses (geoIP data will be missing)
[ ] For Admin IP addresses table, data migration to retroactively hash the IP addresses.
[ ] Drop X-Fastly-IP header at edge

sanjaysiddhanti commented 4 years ago

@di I can help with this. A couple questions:

I downloaded the Geolite2 city DB and the file size is 60M. I don't think it's a good idea to check this into the git repo. Does Warehouse use S3 or something similar where we could store this?
Related to the above, how do you recommend making the DB available to the application? We could pull it from S3 during the Docker build, but I wonder if there's a better option like mounting it as a volume when the container starts. I'm not familiar with how Warehouse is deployed (though I did read https://warehouse.readthedocs.io/application/), so let me know how you typically handle this.

di commented 4 years ago

I downloaded the Geolite2 city DB and the file size is 60M. I don't think it's a good idea to check this into the git repo.

I'm surprised it's that large -- we might want to see if there are more lightweight options, or whether we can slim it down at all (IIUC it probably contains a lot of data we don't need). That said, we check in the development database which clocks in at more than 60MB, so this might be OK for a one-time thing.

Does Warehouse use S3 or something similar where we could store this?

We use a datastore to store PyPI's files, but it wouldn't really be appropriate to put this there. In the best possible case, this would be a package on PyPI we could just add as a dependency, but I couldn't find anything that included it's own database, just libraries that talked to external APIs.

Related to the above, how do you recommend making the DB available to the application?

I think the easiest thing to do would be to add it into the repo and pull it in from there. Given the size though, I'm a little hesitant to say that's the best option.

sanjaysiddhanti commented 4 years ago

I'm looking into lighter weight options, taking inspiration from other libraries. There are also some recent license changes to the Geolite2 DB that we'll need to review, but I'll first look at other options

sanjaysiddhanti commented 4 years ago

I looked at db-ip's City DB. It has a more permissive license, but it's even bigger - 85M.

Both providers offer a CSV format but in both cases, CSV is bigger than the corresponding MMDB file.

I don't know how we would slim down the MMDB file, and curating the CSV file seems like a lot of work, especially since they release regular updates and we may not want to be locked in to the version that we curate.

So it sounds like we can check one of the MMDB files into the git repo, or make an external API call - what was the reason that you didn't like the API call?

di commented 4 years ago

what was the reason that you didn't like the API call?

Potential added expense / external dependency, probably not worth it for this very small feature. Unless we could do this entirely on the frontend, in JS, for free... is that an option?

Agreed we probably don't want to curate CSV files.

sanjaysiddhanti commented 4 years ago

Hmm. We could do it in JS if the user grants access to their location, but then we'd need to store that in a DB to look back at it for future logins. To get the location just from IP, I think we'd still need an API call from the JS code.

There are also country DBs that are much smaller (the Geolite country DB is less than 4M). But I don't think it helps us much to display the country of the user?

di commented 4 years ago

Ah, I meant call some API from JS, not correlate the user's location from their browser w/ their IP.

Another consideration for not using an API is maintaining privacy, i.e. keeping all the IPs w/in Warehouse.

I think just displaying the country is probably too vague to be useful.

SanketDG commented 4 years ago

Ah, I meant call some API from JS

If you are talking about a REST API (and not a library), wouldn't that also route all the IPs to an external location?

pradyunsg commented 4 years ago

@di Is there some reason (perhaps legal) that we can't have the Geolite2 or db-ips actual databases in a Python Package that we make a dependency of warehouse, and not add them into this repository directly?

If we can do that, I feel like we should since we could have it be updated at some appropriate cadence and, more importantly, avoid making the git repository for this project bigger.

di commented 4 years ago

From https://github.com/pypa/warehouse/issues/8158#issuecomment-649235511:

In the best possible case, this would be a package on PyPI we could just add as a dependency, but I couldn't find anything that included it's own database, just libraries that talked to external APIs.

di commented 4 years ago

And yes, I'm assuming we wouldn't be allowed to redistribute it.

patelneel55 commented 4 years ago

Assuming that Warehouse can't store and redistribute the db, there is a public BigQuery table under fh-bigquery.geocode.201806_geolite2_city_ipv4_locs which contains the data from geolite2. I'm not sure if this counts but technically its not a API call from JS and ensures that the calls are made from the backend during user events and the addresses stay within Warehouse. https://cloud.google.com/blog/products/data-analytics/geolocation-with-bigquery-de-identify-76-million-ip-addresses-in-20-seconds#jump-content:~:text=Geolocating%20one%20IP%20address%20out%20of%20millions

pradyunsg commented 4 years ago

And yes, I'm assuming we wouldn't be allowed to redistribute it.

https://db-ip.com/db/download/ip-to-country-lite is under https://creativecommons.org/licenses/by/4.0/, which does allow redistribution.

That's not the case for Geolite's dB though -- they changed licensing last year for California Consumer Privacy Act (CCPA) compliance: https://blog.maxmind.com/2019/12/18/significant-changes-to-accessing-and-using-geolite2-databases/

di commented 1 year ago

Some ideas here on implementing this with more privacy-protecting features around IP addresses as well:

Use Fastly's geolocation services to determine geographic location at edge
Hash & salt IP addresses at edge, pass those to our backends/logs (drop X-Fastly-IP header, populate a X-PyPI-Hashed-IP header.
Replace IP addresses in the user-facing UI (user/project events) with corresponding geolocation data
Replace IP addresses in journals with corresponding hashed IP
For all events tables, change IP columns to CIText, retroactively hash IP addresses (geoIP data will be missing)
For Admin IP addresses table, data migration to retroactively hash the IP addresses.