Open alexanderjarvis opened 11 years ago
What OS is this on?
It was Debian 6.0.7 and java version "1.7.0_10"
I'm seeing this as well on Fedora 18.
To clarify, you're reporting that statsd is working, but occasionally you get these exceptions? What is your firewall configuration on the Play server? It seems the only thing that can cause this is a firewall on the local system blocking it. Do you have puppet or chef periodically reconfiguring the firewall? Often scripts that configure a firewall will initially clear all rules and set a default of blocking everything before moments later adding permissions. Perhaps something like that is happening at the same moment that a stat is being sent?
@jroper thanks for the reply. Yes, statsd is working, but I'm getting this exception every once in a while. The firewall shouldn't be blocking anything outgoing, and there's nothing like Chef or Puppet running to keep configurations in place (I use Ansible and run it manually).
The only thing I can think of is I'm running two instances of the Play app at the same time for downtime-free deployments. I'm wondering if both instances try to hit statsd at the same time, one will lose and produce that error perhaps?
I've done a bit of research, it seems like there is a known bug in netfilter (Linux firewall) that results in packets that should otherwise be dropped (because the network buffers are full) returning eperm (operation not permitted) instead. So, you could just ignore this, and possibly we could modify the statsd filter so these messages can be filtered out of the logs.
But that raises another issue, it sounds like there are times when the statsd reporting is overloading the network buffers. statsd has a nice feature for dealing with this, using rate limiting. The Play StatsdFilter doesn't yet use this, but we could add a configuration option that allowed you to tune the rate, for example, set it to 20%. When you use this, it only reports stats 20% of the time, but it tells statsd that it's reporting these statsd 20% of the time and so statsd adjusts the metrics to so the numbers still come out to be what happens 100% of the time.
I don't really have the time to do this myself now, but a pull request that does it will certainly be accepted.
Statsd seems to be working for me, but upon inspection of the application logs I found the following exception and could not reproduce it.
It is either something is being sent to DatagramSocket that it cannot process (incomplete data perhaps), or the underlying system is blocking that connection (but only for that send as otherwise stats are coming through fine).