pypi / warehouse

The Python Package Index
https://pypi.org
Apache License 2.0
3.52k stars 946 forks source link

Geolocate user IP addresses when presenting them in UI #8158

Open di opened 4 years ago

di commented 4 years ago

What's the problem this feature will solve? Currently in the PyPI logged-in UI, we show the IP address that performed certain actions to the user:

Screen Shot 2020-06-24 at 6 37 55 PM

I don't know my own IP offhand. Especially if there are multiple different IPs listed here, I would need to manually look up the approximate location where these came from to get an idea of whether they were actually me or not.

Describe the solution you'd like It would be nice if PyPI also showed me an (approximate) location for any given IP address as well, so I could easily visually filter ones that seem incorrect, e.g.:

Event Date / time IP address
Logged in less than 10 seconds ago 11.22.11.22 (Austin TX USA)
Logged in June 22, 2020 22.33.22.33 (Austin TX USA)
Logged in June 19, 2020 44.55.44.55 (Timbuktu, Mali)
Logged in June 19, 2020 66.77.66.77 (Austin TX, USA)

Additional context This shouldn't require external API calls. Using something like https://pypi.org/project/geoip2/ with an embedded database like https://dev.maxmind.com/geoip/geoip2/geolite2/ would probably work.

Ideally this would be determined on the fly and not stored anywhere (e.g. along with the IP address), so if we someday replaced the mechanism with something more precise (or just updated the embedded DB) the updates would be immediately reflected.

Todo list

sanjaysiddhanti commented 4 years ago

@di I can help with this. A couple questions:

di commented 4 years ago

I downloaded the Geolite2 city DB and the file size is 60M. I don't think it's a good idea to check this into the git repo.

I'm surprised it's that large -- we might want to see if there are more lightweight options, or whether we can slim it down at all (IIUC it probably contains a lot of data we don't need). That said, we check in the development database which clocks in at more than 60MB, so this might be OK for a one-time thing.

Does Warehouse use S3 or something similar where we could store this?

We use a datastore to store PyPI's files, but it wouldn't really be appropriate to put this there. In the best possible case, this would be a package on PyPI we could just add as a dependency, but I couldn't find anything that included it's own database, just libraries that talked to external APIs.

Related to the above, how do you recommend making the DB available to the application?

I think the easiest thing to do would be to add it into the repo and pull it in from there. Given the size though, I'm a little hesitant to say that's the best option.

sanjaysiddhanti commented 4 years ago

I'm looking into lighter weight options, taking inspiration from other libraries. There are also some recent license changes to the Geolite2 DB that we'll need to review, but I'll first look at other options

sanjaysiddhanti commented 4 years ago

I looked at db-ip's City DB. It has a more permissive license, but it's even bigger - 85M.

Both providers offer a CSV format but in both cases, CSV is bigger than the corresponding MMDB file.

I don't know how we would slim down the MMDB file, and curating the CSV file seems like a lot of work, especially since they release regular updates and we may not want to be locked in to the version that we curate.

So it sounds like we can check one of the MMDB files into the git repo, or make an external API call - what was the reason that you didn't like the API call?

di commented 4 years ago

what was the reason that you didn't like the API call?

Potential added expense / external dependency, probably not worth it for this very small feature. Unless we could do this entirely on the frontend, in JS, for free... is that an option?

Agreed we probably don't want to curate CSV files.

sanjaysiddhanti commented 4 years ago

Hmm. We could do it in JS if the user grants access to their location, but then we'd need to store that in a DB to look back at it for future logins. To get the location just from IP, I think we'd still need an API call from the JS code.

There are also country DBs that are much smaller (the Geolite country DB is less than 4M). But I don't think it helps us much to display the country of the user?

di commented 4 years ago

Ah, I meant call some API from JS, not correlate the user's location from their browser w/ their IP.

Another consideration for not using an API is maintaining privacy, i.e. keeping all the IPs w/in Warehouse.

I think just displaying the country is probably too vague to be useful.

SanketDG commented 4 years ago

Ah, I meant call some API from JS

If you are talking about a REST API (and not a library), wouldn't that also route all the IPs to an external location?

pradyunsg commented 4 years ago

@di Is there some reason (perhaps legal) that we can't have the Geolite2 or db-ips actual databases in a Python Package that we make a dependency of warehouse, and not add them into this repository directly?

If we can do that, I feel like we should since we could have it be updated at some appropriate cadence and, more importantly, avoid making the git repository for this project bigger.

di commented 4 years ago

From https://github.com/pypa/warehouse/issues/8158#issuecomment-649235511:

In the best possible case, this would be a package on PyPI we could just add as a dependency, but I couldn't find anything that included it's own database, just libraries that talked to external APIs.

di commented 4 years ago

And yes, I'm assuming we wouldn't be allowed to redistribute it.

patelneel55 commented 4 years ago

Assuming that Warehouse can't store and redistribute the db, there is a public BigQuery table under fh-bigquery.geocode.201806_geolite2_city_ipv4_locs which contains the data from geolite2. I'm not sure if this counts but technically its not a API call from JS and ensures that the calls are made from the backend during user events and the addresses stay within Warehouse. https://cloud.google.com/blog/products/data-analytics/geolocation-with-bigquery-de-identify-76-million-ip-addresses-in-20-seconds#jump-content:~:text=Geolocating%20one%20IP%20address%20out%20of%20millions

pradyunsg commented 4 years ago

And yes, I'm assuming we wouldn't be allowed to redistribute it.

https://db-ip.com/db/download/ip-to-country-lite is under https://creativecommons.org/licenses/by/4.0/, which does allow redistribution.

That's not the case for Geolite's dB though -- they changed licensing last year for California Consumer Privacy Act (CCPA) compliance: https://blog.maxmind.com/2019/12/18/significant-changes-to-accessing-and-using-geolite2-databases/

di commented 1 year ago

Some ideas here on implementing this with more privacy-protecting features around IP addresses as well:

di commented 1 year ago

GeoIP and salting at edge are done in https://github.com/pypi/infra/pull/123

di commented 1 year ago

Logging salted IPs are done in https://github.com/pypi/warehouse/pull/13389

ewdurbin commented 1 year ago

We now display GeoIP information if available: #13745

ewdurbin commented 1 year ago

Ope, missed that this was a meta issue.

ewdurbin commented 1 year ago

Begin storing hashed IPs everywhere for all events: #13716, #13744

Replace IP addresses in the user-facing UI (user/project events) with corresponding geolocation data #13745

ewdurbin commented 1 year ago

submitted_from column dropped from journals table in #13751 and #13752