Closed johnsusi closed 4 years ago
I'm able to remove the deadlock with the following change:
internal bool Remove(TKey key, out TValue value, int timeout)
{
bool rv = false;
bool wasAtCapacity = false;
value = default(TValue);
lock (dLock)
{
if (!finished)
{
// check and wait if empty
while (d.Count == 0)
{
if (timeout < 0)
{
Monitor.Wait(dLock);
}
else
{
if (timeout > 0)
{
if (Monitor.Wait(dLock, timeout) == false)
{
throw new Exception("timeout");
}
}
}
}
if (!finished)
{
rv = d.TryGetValue(key, out value);
}
}
if (rv)
{
wasAtCapacity = d.Count >= maxSize;
d.Remove(key);
}
}
// wasAtCapacity is updated inside a lock
// so move waiting on addLock outside of dLock
if (wasAtCapacity)
{
lock (addLock)
{
Monitor.Pulse(addLock);
}
}
return rv;
} // get
I'm evaluating whether other changes may be more appropriate, such as not holding StanConnection.mu
when calls to BlockingDictionary.Remove
are made.
@watfordgnf , really appreciate you taking a look at this. Thanks!
Wow that was quick, thanks :)
We solved it temporarily by replacing the wait for space with a simple sleep/retry loop
while (!pubAckMap.TryAdd(guidValue, a))
{
var bd = pubAckMap;
Monitor.Exit(mu);
// Wait for space outside of the lock so
// acks can be removed.
// bd.waitForSpace();
Thread.Sleep(10);
Monitor.Enter(mu);
if (nc == null)
{
throw new StanConnectionClosedException();
}
}
It will not affect performance since this only happens when we are writing to fast anyway.
Looking forward to trying out your changes.
@johnsusi I applied a very similar fix as well, such that it will not wait for space in the pubAckMap any longer than the ping interval, which fixes a second deadlock seen with unexpected connection failures. I removed the deadlock you saw by ensuring addLock
and dLock
are never held under the same path.
Another interesting bit is pubAckMap.Remove
no longer needs protection under mu
, removing a possible point of contention under load.
Very nice @watfordgnf, seems my issue is resolved at least on my simple example. Will try out with our real system and see if any problems arise and report back here after.
@johnsusi, would love the feedback - much appreciated!
Everything seems to be working now so big thumbs up here.
Thanks for checking @johnsusi!
First a big thanks for all your hard work. It is very appreciated!
We are having some issues in a system which publishes a lot of data (lots of small messages < 1k).
I have narrowed it down to multithreading and BlockingCollection. My guess is that the issue arises from Remove and waitForSpace takes the locks in different order.
Here is a small test that, on my computer, hits the deadlock every time. But timing is probably a factor here.
Server was run using
nats-streaming-server -p 4222 -st file -dir $PWD/data
on a macbook pro and on a windows 10 pc.
I tried looking into fixing the problem myself but still no progress.