Closed mitchellwrosen closed 5 years ago
I'm not very familiar with this part of ekg
, would be nice if @tibbe could take a look.
Bump, thoughts @tibbe? :)
@tibbe thinks it's good to go, merging.
Merged, thanks!
I don't know how, but this change led to a 100% reproducible lock-up in one of our EKG enabled services at work (Strats SCB). This is using 8.4.4. We forked ekg-core while preparing for an upgrade to 8.6.2. Unfortunately I cannot provide a repro for obvious reasons.
Ouch. I'm sorry about that @pepeiborra. Can you provide any more information?
Reverted, released a new bugfix version.
I would love to @mitchellwrosen but I can't share any concrete artefacts from work. The only details I have:
+RTS -De -Ds
iirc. This suggested it was an unsafe foreign call stealing the thread away from the Haskell scheduler.Possible diagnosis:
The quick exit path in hs_distrib_add_n
added here doesn't release the lock so any subsequent calls to hs_distrib_add_n
with the same b
param will wait forever in hs_lock(&b->lock)
.
hs_lock(&b->lock);
+ if (!b->count) {
+ return;
+ }
Returning after taking a lock, this cannot work.
@nh2 Right, that's the obvious bug. I apologize :(
@mitchellwrosen Can happen! It seems ekg
needs better tests if a simple change like this can take down Standard Chartered's upgrade at runtime.
EDIT: I stand corrected. ekg
has no tests at all.
This is one possible solution to #24.