Open camlee opened 2 years ago
Good catch! Having a DNS lookup every message can be a huge slowdown!
Though, I'm hesitant on having this override the default behavior of GELFUDPHandler
as its supposed to have connection properties similar to the DatagramHandler
. This session-ed GELFUDPHandler
should likely be defined in its own subclass.
One thing to consider is that with UDP socket session connections things like NATs, or stateful firewalls may over time may decide to drop the connection and leave the sender and receiver unaware that the socket is now dead. This is likely one of the reasons DatagramHandler
uses sendto
and re-establishes a connection every message.
If possible, are you able to setup/verify DNS caching on the host running your graypy logger? That would likely reduce the number of DNS lookups to your DNS server while graypy logs multiple messages.
I'm on a team that runs multiple deployed apps in a variety of languages and on both virtual machines and within kubernetes. We have indeed setup DNS caching on the host machine (in some cases, on the Linux box, in other cases as a sidecar container within kubernetes). This does work but introduces some deployment complexity. For our Python microservices, we've been using a graypy fork with the code in this PR so that the problem is solved within the app: nice and simple.
As for things like NAT; yes, there is indeed the risk that the connection dies and data is lost. For us in particular, the IP address behind the DNS name for our graylog server rotates every 2 weeks or so. This is why I included a mechanism to refresh the socket, and made it configurable: sock_max_age
. We could lower the default from 5 minutes if you think that would be safer. For what it's worth though, we also use pystatsd in our microservices which also sends data out over UDP (metrics instead of logs, but same basic idea) and discovered that it actually resolves the IP address on startup and uses that forever.
I'm submitting this PR to help others who use graypy, particularly those who's systems are growing and aren't going to find out about this DNS issue until it bites them in production. I think the out-of-the-box behavior can and should be efficient. Let me know if I can tweak or modify this PR in a way that makes it more adoptable. A new subclass for this "session-like" handler sounds reasonable to me and I'll make that change if you want.
I ran into performance issues in prod because I'm using GELFUDPHandler with a host that requires a DNS lookup (our graylog server is available at something like
monitoring.prod.example.com
). This resulted in a DNS lookup for every single log message because the CPython DatagramHandler usessocket.sendto()
like this:self.sock.sendto(s, self.address)
: https://github.com/python/cpython/blob/main/Lib/logging/handlers.py#L732The production-impacting issue we had was overwhelming our internal DNS server with too much traffic which slowed down logging and brought our web services to a crawl. Even for a healthy DNS server, the time to do a DNS lookup per log message is non-zero and worth saving, particularly for apps that do a lot of logging.
This Pull Request updates GELFUDPHandler to connect the socket to the address ahead of time using
socket.connect()
and then send the data usingsocket.sendall()
. Using wireshark, I've tested that the DNS lookup occurs duringconnect()
and notsendall()
. I've also been running this algorithm in production for months and seen the expected drop in traffic to the DNS server.To handle DNS changes, the code does make a new socket periodically and there's a
sock_max_age
parameter to control how often (I've set this to 1 hour in prod but figured 5 minutes was a more reasonable default in the library here).