nats-io / stan.net

The official NATS .NET C# Streaming Client
Apache License 2.0
137 stars 41 forks source link

Deadlock in BlockingDictionary #175

Closed johnsusi closed 4 years ago

johnsusi commented 4 years ago

First a big thanks for all your hard work. It is very appreciated!

We are having some issues in a system which publishes a lot of data (lots of small messages < 1k).

I have narrowed it down to multithreading and BlockingCollection. My guess is that the issue arises from Remove and waitForSpace takes the locks in different order.

Here is a small test that, on my computer, hits the deadlock every time. But timing is probably a factor here.

        public static void DeadlockTest()
        {
          var options = StanOptions.GetDefaultOptions();
          options.MaxPubAcksInFlight = 100;
          using var connection = new StanConnectionFactory().CreateConnection("test-cluster", "test-client", options);

          var subject = Guid.NewGuid().ToString();
          var payload = new byte[1024];

          long count = 0;

          Action action = () =>
          {
            while (true)
            {
              connection.Publish(subject, payload, (source, args) =>
              {
              });
              Interlocked.Increment(ref count);
            }
          };

          Func<Task> report = async () =>
          {
            var sw = new Stopwatch();
            sw.Start();
            while (true)
            {
              await Task.Delay(1000);
              var _count = Interlocked.Read(ref count);
              Console.WriteLine($"Publish {_count} in {sw.ElapsedMilliseconds} ms. {1000*_count/sw.ElapsedMilliseconds} msg/sec.");
            }
          };

          Task.WaitAll(
            Task.Run(action),
            Task.Run(action),
            report()
          );

        }

Server was run using

nats-streaming-server -p 4222 -st file -dir $PWD/data

on a macbook pro and on a windows 10 pc.

I tried looking into fixing the problem myself but still no progress.

watfordgnf commented 4 years ago

I'm able to remove the deadlock with the following change:

        internal bool Remove(TKey key, out TValue value, int timeout)
        {
            bool rv = false;
            bool wasAtCapacity = false;

            value = default(TValue);

            lock (dLock)
            {
                if (!finished)
                {
                    // check and wait if empty
                    while (d.Count == 0)
                    {
                        if (timeout < 0)
                        {
                            Monitor.Wait(dLock);
                        }
                        else
                        {
                            if (timeout > 0)
                            {
                                if (Monitor.Wait(dLock, timeout) == false)
                                {
                                    throw new Exception("timeout");
                                }
                            }
                        }
                    }

                    if (!finished)
                    {
                        rv = d.TryGetValue(key, out value);
                    }
                }

                if (rv)
                {
                    wasAtCapacity = d.Count >= maxSize;
                    d.Remove(key);
                }
            }

            // wasAtCapacity is updated inside a lock
            // so move waiting on addLock outside of dLock
            if (wasAtCapacity)
            {
                lock (addLock)
                {
                    Monitor.Pulse(addLock);
                }
            }

            return rv;

        } // get
watfordgnf commented 4 years ago

I'm evaluating whether other changes may be more appropriate, such as not holding StanConnection.mu when calls to BlockingDictionary.Remove are made.

ColinSullivan1 commented 4 years ago

@watfordgnf , really appreciate you taking a look at this. Thanks!

johnsusi commented 4 years ago

Wow that was quick, thanks :)

We solved it temporarily by replacing the wait for space with a simple sleep/retry loop

               while (!pubAckMap.TryAdd(guidValue, a))
                {
                    var bd = pubAckMap;

                    Monitor.Exit(mu);
                    // Wait for space outside of the lock so 
                    // acks can be removed.
                    // bd.waitForSpace();
                    Thread.Sleep(10);
                    Monitor.Enter(mu);

                    if (nc == null)
                    {
                        throw new StanConnectionClosedException();
                    }
                }

It will not affect performance since this only happens when we are writing to fast anyway.

Looking forward to trying out your changes.

watfordgnf commented 4 years ago

@johnsusi I applied a very similar fix as well, such that it will not wait for space in the pubAckMap any longer than the ping interval, which fixes a second deadlock seen with unexpected connection failures. I removed the deadlock you saw by ensuring addLock and dLock are never held under the same path.

Another interesting bit is pubAckMap.Remove no longer needs protection under mu, removing a possible point of contention under load.

johnsusi commented 4 years ago

Very nice @watfordgnf, seems my issue is resolved at least on my simple example. Will try out with our real system and see if any problems arise and report back here after.

ColinSullivan1 commented 4 years ago

@johnsusi, would love the feedback - much appreciated!

johnsusi commented 4 years ago

Everything seems to be working now so big thumbs up here.

ColinSullivan1 commented 4 years ago

Thanks for checking @johnsusi!