ukwa / ukwa-services

Deployment configuration for all UKWA services stacks.
Apache License 2.0
4 stars 5 forks source link

Add support for keeping the Geo-IP database updated for Domain Crawls #123

Open anjackson opened 7 months ago

anjackson commented 7 months ago

For Domain Crawls, we rely on (GeoLite2 Free Geolocation Data)[https://dev.maxmind.com/geoip/geolite2-free-geolocation-data] to find URLs that are in the UK but not on UK domain names.

Maxmind stopped allowing unauthenticated downloads to that DB file, so now we need to find a different way to keep it up to date. This likely means using the GEOLITE2_CITY_MMDB_LOCATION configuration option to map the DB file in from the host rather than used the version embedded in the ukwa/ukwa-heritrix container, and then documenting how to update it as part of the DC setup process.