Open di opened 4 years ago
@di I can help with this. A couple questions:
I downloaded the Geolite2 city DB and the file size is 60M. I don't think it's a good idea to check this into the git repo. Does Warehouse use S3 or something similar where we could store this?
Related to the above, how do you recommend making the DB available to the application? We could pull it from S3 during the Docker build, but I wonder if there's a better option like mounting it as a volume when the container starts. I'm not familiar with how Warehouse is deployed (though I did read https://warehouse.readthedocs.io/application/), so let me know how you typically handle this.
I downloaded the Geolite2 city DB and the file size is 60M. I don't think it's a good idea to check this into the git repo.
I'm surprised it's that large -- we might want to see if there are more lightweight options, or whether we can slim it down at all (IIUC it probably contains a lot of data we don't need). That said, we check in the development database which clocks in at more than 60MB, so this might be OK for a one-time thing.
Does Warehouse use S3 or something similar where we could store this?
We use a datastore to store PyPI's files, but it wouldn't really be appropriate to put this there. In the best possible case, this would be a package on PyPI we could just add as a dependency, but I couldn't find anything that included it's own database, just libraries that talked to external APIs.
Related to the above, how do you recommend making the DB available to the application?
I think the easiest thing to do would be to add it into the repo and pull it in from there. Given the size though, I'm a little hesitant to say that's the best option.
I'm looking into lighter weight options, taking inspiration from other libraries. There are also some recent license changes to the Geolite2 DB that we'll need to review, but I'll first look at other options
I looked at db-ip's City DB. It has a more permissive license, but it's even bigger - 85M.
Both providers offer a CSV format but in both cases, CSV is bigger than the corresponding MMDB file.
I don't know how we would slim down the MMDB file, and curating the CSV file seems like a lot of work, especially since they release regular updates and we may not want to be locked in to the version that we curate.
So it sounds like we can check one of the MMDB files into the git repo, or make an external API call - what was the reason that you didn't like the API call?
what was the reason that you didn't like the API call?
Potential added expense / external dependency, probably not worth it for this very small feature. Unless we could do this entirely on the frontend, in JS, for free... is that an option?
Agreed we probably don't want to curate CSV files.
Hmm. We could do it in JS if the user grants access to their location, but then we'd need to store that in a DB to look back at it for future logins. To get the location just from IP, I think we'd still need an API call from the JS code.
There are also country DBs that are much smaller (the Geolite country DB is less than 4M). But I don't think it helps us much to display the country of the user?
Ah, I meant call some API from JS, not correlate the user's location from their browser w/ their IP.
Another consideration for not using an API is maintaining privacy, i.e. keeping all the IPs w/in Warehouse.
I think just displaying the country is probably too vague to be useful.
Ah, I meant call some API from JS
If you are talking about a REST API (and not a library), wouldn't that also route all the IPs to an external location?
@di Is there some reason (perhaps legal) that we can't have the Geolite2 or db-ips actual databases in a Python Package that we make a dependency of warehouse, and not add them into this repository directly?
If we can do that, I feel like we should since we could have it be updated at some appropriate cadence and, more importantly, avoid making the git repository for this project bigger.
From https://github.com/pypa/warehouse/issues/8158#issuecomment-649235511:
In the best possible case, this would be a package on PyPI we could just add as a dependency, but I couldn't find anything that included it's own database, just libraries that talked to external APIs.
And yes, I'm assuming we wouldn't be allowed to redistribute it.
Assuming that Warehouse can't store and redistribute the db, there is a public BigQuery table under fh-bigquery.geocode.201806_geolite2_city_ipv4_locs
which contains the data from geolite2. I'm not sure if this counts but technically its not a API call from JS and ensures that the calls are made from the backend during user events and the addresses stay within Warehouse. https://cloud.google.com/blog/products/data-analytics/geolocation-with-bigquery-de-identify-76-million-ip-addresses-in-20-seconds#jump-content:~:text=Geolocating%20one%20IP%20address%20out%20of%20millions
And yes, I'm assuming we wouldn't be allowed to redistribute it.
https://db-ip.com/db/download/ip-to-country-lite is under https://creativecommons.org/licenses/by/4.0/, which does allow redistribution.
That's not the case for Geolite's dB though -- they changed licensing last year for California Consumer Privacy Act (CCPA) compliance: https://blog.maxmind.com/2019/12/18/significant-changes-to-accessing-and-using-geolite2-databases/
Some ideas here on implementing this with more privacy-protecting features around IP addresses as well:
X-Fastly-IP
header, populate a X-PyPI-Hashed-IP
header.GeoIP and salting at edge are done in https://github.com/pypi/infra/pull/123
Logging salted IPs are done in https://github.com/pypi/warehouse/pull/13389
We now display GeoIP information if available: #13745
Ope, missed that this was a meta issue.
Begin storing hashed IPs everywhere for all events: #13716, #13744
Replace IP addresses in the user-facing UI (user/project events) with corresponding geolocation data #13745
submitted_from
column dropped from journals
table in #13751 and #13752
What's the problem this feature will solve? Currently in the PyPI logged-in UI, we show the IP address that performed certain actions to the user:
I don't know my own IP offhand. Especially if there are multiple different IPs listed here, I would need to manually look up the approximate location where these came from to get an idea of whether they were actually me or not.
Describe the solution you'd like It would be nice if PyPI also showed me an (approximate) location for any given IP address as well, so I could easily visually filter ones that seem incorrect, e.g.:
Additional context This shouldn't require external API calls. Using something like https://pypi.org/project/geoip2/ with an embedded database like https://dev.maxmind.com/geoip/geoip2/geolite2/ would probably work.
Ideally this would be determined on the fly and not stored anywhere (e.g. along with the IP address), so if we someday replaced the mechanism with something more precise (or just updated the embedded DB) the updates would be immediately reflected.
Todo list
Replace IP addresses in journals with corresponding hashed IP