haskell-github-trust / ekg-core

Library for tracking system metrics
BSD 3-Clause "New" or "Revised" License
40 stars 39 forks source link

Don't combine distribution if count == 0 #25

Closed mitchellwrosen closed 5 years ago

mitchellwrosen commented 6 years ago

This is one possible solution to #24.

23Skidoo commented 6 years ago

I'm not very familiar with this part of ekg, would be nice if @tibbe could take a look.

mitchellwrosen commented 5 years ago

Bump, thoughts @tibbe? :)

23Skidoo commented 5 years ago

@tibbe thinks it's good to go, merging.

23Skidoo commented 5 years ago

Merged, thanks!

pepeiborra commented 5 years ago

I don't know how, but this change led to a 100% reproducible lock-up in one of our EKG enabled services at work (Strats SCB). This is using 8.4.4. We forked ekg-core while preparing for an upgrade to 8.6.2. Unfortunately I cannot provide a repro for obvious reasons.

mitchellwrosen commented 5 years ago

Ouch. I'm sorry about that @pepeiborra. Can you provide any more information?

23Skidoo commented 5 years ago

Reverted, released a new bugfix version.

pepeiborra commented 5 years ago

I would love to @mitchellwrosen but I can't share any concrete artefacts from work. The only details I have:

adamse commented 5 years ago

Possible diagnosis:

The quick exit path in hs_distrib_add_n added here doesn't release the lock so any subsequent calls to hs_distrib_add_n with the same b param will wait forever in hs_lock(&b->lock).

nh2 commented 5 years ago
  hs_lock(&b->lock);

+  if (!b->count) {
+    return;
+  }

Returning after taking a lock, this cannot work.

mitchellwrosen commented 5 years ago

@nh2 Right, that's the obvious bug. I apologize :(

nh2 commented 5 years ago

@mitchellwrosen Can happen! It seems ekg needs better tests if a simple change like this can take down Standard Chartered's upgrade at runtime.

EDIT: I stand corrected. ekg has no tests at all.