sivy / pystatsd

Python implementation of the Statsd client/server
http://pypi.python.org/pypi/pystatsd/
BSD 2-Clause "Simplified" License
358 stars 87 forks source link

Race condition between __record_timer() and flush() #86

Open oseemann opened 10 years ago

oseemann commented 10 years ago

Occasionally our pystatsd instance freezes, maybe in a similar way as in #40. Strace showed a KeyError in the following line:

See server.py:118

if key not in self.timers:
            self.timers[key] = [ [], ts ]
self.timers[key][0].append(float(value or 0))
self.timers[key][1] = ts  # <--- KeyError

It seems that the check ensures that the key is always present in self.timers. However, there's a second thread running, flushing the contents of self.timers periodically, and also deleting the contents of self.timers.

See server.py:228

del(self.timers[k])

When the flush happens right between the check in line 115 and the access in lines 117 and 118 then a KeyError will be raised. We think this is exactly what happens.

We haven't seen it happen in self.__record_counter, but it looks vulnerable in the same way.

We haven't had time yet to come up with a patch, but maybe something like this would work:

Use a threading.Lock() to synchronize access to self.timers and self.counters. The self.flush() involves network communication and can therefore take a long time. We don't want to block the receiving code during that time. Therefore maybe just use the lock to (deep-)copy self.timers and reset it. Then release the lock and flush from the copy. (Or not copy at all, but just switch references, i.e. assign self.timers to a second variable pointing to the same dictionary and then assign self.timers an empty dict and continue flushing from the new variable.)

oseemann commented 10 years ago

I just checked the master code and it doesn't involve the key check anymore (we're using 0.1.10). So maybe that bug is fixed. What's the ETA for the next release?

oseemann commented 10 years ago

I looked at the changeset that fixes the problem.

It now looks like this:

timer = self.timers.setdefault(key, [ [], ts ])
timer[0].append(float(value or 0))
timer[1] = ts

When during flush the key is now deleted directly after setdefault returns, then the new value will be stored only into the local timer value. And, I believe, it will simply be lost and never flushed.

I think that's OK, depending on the guarantees that pystatsd wants to make. If messages should not be dropped, then a different solution must be chosen.

sivy commented 10 years ago

@vvuksan Can you take a look at this when you get some time?