Geocoding Customer IP address -> Country / City

henrymori commented 4 years ago

Following up on https://github.com/papercups-io/papercups/issues/57 which fixes capturing the end-customer's IP address correctly:

Now that the customer's IP address is being captured correctly, it's possible to do some geocoding to find out the country & city of the customer and store that data on the customer record.

A few ideas I wanted to kick around regarding implementation:

Option 1 (self-host free MaxMind data + ETS or Redis Cache):

The open source MaxMind GeoLite2 database could be used for geocoding data (pretty much the canonical standard in this space), alongside geolix and geolix_adapter_mmdb2.

When someone is self hosting, they would then be required to register for their own MaxMind account and obtain an API key to obtain a copy of the database: https://blog.maxmind.com/2019/12/18/significant-changes-to-accessing-and-using-geolite2-databases/
The database is updated on Tuesday each week, which means that the version that the developer downloads today and starts using will be outdated the following Tuesday when a new version of the database is released. Whether data "freshness" matters much for the purposes here, I cannot say for certain. I would assume that the database doesn't change that frequently -- perhaps especially so for IPv4 addresses, maybe less so for IPv6 which are still coming online.
It should be fairly trivial to set up an Oban job to grab the latest version of the database each week, but this first implementation pass should probably be more narrowly scoped/naive and assume that the Maxmind database downloaded and loaded into the add by the developer on day 1 is already present on app boot up.
geolix appears to take the Maxmind DB file and load it into an ets table, so lookups will be fast once the app is booted up. The potential issue with this: the DB is 125MB+ zipped before loading that data into an ets table (in memory cache). I'm not certain how much memory will be taken up once the full dataset is loaded into memory locally for each pod/node of Elixir running, but I'd imagine that the RAM usage per pod may increase by 50-150MB+. It's possible that the .mmb file format is less efficient than ets storage, so the on-disk space may not represent the in-memory space, but I'd have to dig in further and discuss this here before spending more time on this. Startup time for the pod/nodes will also be increased if every node has to build a local cached ets table of the data -- by how much, I couldn't say for certain until this was prototyped & tested.
Another option would be to introduce a centralized cache option i.e. Redis. This way, the Elixir pods don't have to store the data locally and there's a central 'source of truth' for the lookup data. The downsides are: (a) another moving piece of infrastructure to set up and maintain (b) cost: Redis is expensive to run (c) less performance than ETS (the Redis instance is typically running on a separate server/managed service, necessitating a network call).

Option 2 (call an external service):

Look into implemeting something like https://github.com/navinpeiris/geoip where an external service is called (providing an API key) and then layer in some caching. The caching wouldn't necesarily be beneficial (if at all) in a multi-node setup unless it's distributed via libcluster i.e. a customer could hit node A on the first request, the result is cached locally on node A but then their second request is served by node B which doesn't have that data locally cached.
ets lookups in option 1 are going to be at least x10 faster in all likelihood than making an external HTTP request (~10ms vs ~100ms+).
external services charge "per lookup" with limits on cache duration in their ToCs (if they permit caching at all in their policies -- this seems to vary by service). Option 2 could end up being a lot more costly and significantly less performant (unless the determination and settlement of the country/city lookup data is made eventually consistent by kicking off an async process to with a Task or simple GenServer that deals with the business logic of doing the lookup and saving the customer's country/city off of the request path so failure/latency concerns are less of an issue).

Other considerations:

Whichever approach is chosen, should we make geocoding optional through a runtime environment flag? i.e. optional for those that want geocoding but not mandatory for the app to run?
In the event that a customer changes locations/IPs, currently the IP address appears to be saved when the record is initially created and not updated when/if a customer changes IP address (correct me if I'm wrong)? How important is it to maintain or change this behavior? A customer moving between timezones/traveling should probably have their accurate IP address stored in the DB at any point in time?

Does the ip_inferred_country (or whatever the property is called) need to be saved in the database or can it be an Ecto virtual field that only the frontend cares about and as such is generated dynamically on-the-fly? One benefit of storing the field in the DB and making a call to ets or an external service means that if Option 2 is viable, a call would only have to be made when (1) the customer is initially created and (2) in subsequent calls only if the ip_address stored on the customer record doesn't match that requests' current ip address.

Keen to hear your thoughts!

henrymori commented 4 years ago

@reichert621 ^^ As discussed, some thoughts on geocoding implementations 👍

henrymori commented 4 years ago

Consolidated discussion from the Slack thread (https://papercups-io.slack.com/archives/C0189MJHKMJ/p1599057225003600?thread_ts=1599005083.000900&cid=C0189MJHKMJ):

cheeseblubber commented 3 years ago

Closing this issue. Using browser information is good enough

papercups-io / papercups

Geocoding Customer IP address -> Country / City #202