tumblr / colossus

I/O and Microservice library for Scala
Apache License 2.0
1.14k stars 96 forks source link

Rate.hit race condition #485

Closed dre-hh closed 7 years ago

dre-hh commented 7 years ago

The following code crates 2 futures, each hitting the same rate 3 times. The flush interval is set to 1 second. After sleeping 3.5 second the procedure is repeated.

  def hitRate : Any = {
    0 to 1 foreach { _ =>
      Future {
        val c = Rate("rate")
        c.hit()
        c.hit()
        c.hit()
      }
    }
  }

   hitRate
   Thread.sleep(3500)
   hitRate

On the first run, the rate counts 5 hits, where on the second it correctly counts 6, even if pruning empty rates is set to true.

Is it something, which can be dealt with, or was it considerably not handled?

DanSimon commented 7 years ago

Are you able to reproduce this consistently? I tried this a bunch of times but haven't been able to reproduce.

I can see two possible ways that the results you're seeing could be happening, but they are extremely unlikely and I was only able to reproduce the first when I did 100,000 hits to the rate instead of 3.

  1. The value collection could happen in between the 5th and 6th hit, so the last hit is counted in the next collection interval, but overall no hits are lost.
  2. When pruneEmpty is enabled, it is possible a hit is lost if it occurs at the exact same moment the value is pruned, so it could happen if the collection and pruning happens just before the 1st hit when the rate had previously been 0 during the last collection interval.

It's possible there's another race condition somewhere, but at least right now I don't see any.

dre-hh commented 7 years ago

i am getting this consistently with this example https://github.com/tumblr/colossus/pull/486 regardless of prune-empty setting

DanSimon commented 7 years ago

Ah ha! I think I found the race condition. It's very odd that I could not reproduce this with my own test which looks similar to your example, but was able to consistently repeat it using your branch.

The race condition occurs if two threads try to increment the same rate at the same time, where the rate had previously never been used. This is why only the first collection interval shows a problem and afterwards all the hits are registered.

It's just a one line fix so I'll have a PR prepared shortly.

dre-hh commented 7 years ago

btw ive overseen how the examples are setup in the main class. ill close the small pull request with the metrics example. I think a better example would be something, where you hit counters over http, and then they are logged into a file