mediacloud / story-indexer

The core pipeline used to ingest online news stories in the Media Cloud archive.
https://mediacloud.org
Apache License 2.0
2 stars 5 forks source link

Fix TEMPORARY HACK in indexer/app.py for resolving stats and logging host names #232

Closed philbudne closed 9 months ago

philbudne commented 10 months ago

To avoid having processes ever ending up blocked by a frozen syslog sink or stats receiver, both use UDP. The underlying libraries pass the hostnames thru to the socket.sendto method which performs a DNS lookup on EVERY CALL. This was throttling the Queuer class to under 200 stories/second (a problem when ingesting archives).

The TEMPORARY HACK was to resolve the hostnames on startup.

This is relatively safe for stats, where the destination hostname is "tarbell", and a proxy process directs the packets to the statsd process in the grafana-graphite-statsd container there, but is unsafe for logging, where the syslog-sink host is a container in the stack, and the IP address could change WHEN the container is restarted.

The fix is to subclass logging.SyslogHandler (and StatsdClient) to keep a one line cache of the last hostname passed, the resolved IP address, and the timestamp of the resolution. If the hostname is the same, but the timestamp is over a minute old, re-resolve the IP address and update the cache, else use the cached version.