Fix Lost Spans on Close

phxnsharp commented 4 years ago

Which problem is this PR solving?

Fixes #181

Short description of the changes

Changes InMemorySender to reproduce the lost spans problem. Then fixed the lost spans problem. A number of minor mistakes in the threading was causing both thread pool exhaustion through long lived calls on the thread pool, and thread leaks from previous tests.

phxnsharp commented 4 years ago

I would prefer keeping the ProcessQueueLoop async if possible and checking if the executed command can be made non-blocking and also really async. This should also improve the SDK as a whole.

Using SemaphoreSlim to make this completely asynchronous should solve the problem.

Thanks for your review! This is great stuff! Your points are totally valid. Let me clarify some of what I discovered while writing this:

SemaphoreSlim - I actually tried this first, before hiding ManualResetEventSlim.Wait behind a long lived Task. Unfortunately SemaphoreSlim doesn't work the same as a ManualResetEventSlim. The former is for contention handling (like a mutex), the latter is for event signalling. With some effort we could probably convince SemaphoreSlim to work like a simple event signal, but it is awkward and bug prone. You can't arbitrarily release another thread without taking ownership on this thread. The advantages of WaitAsync didn't seem worth it to me for the disadvantages of misusing the wrong threading primitive. Note that there are third party solutions out there for doing WaitAsync with an event like object. As this is only used in tests, never in production, those felt like overkill for this problem.

One "middle of the road" approach would be to encapsulate the concept of WaitAsync into a new ManualResetEventSlim extension method. For now it can use my long running Task solution. If an appropriate third party or Microsoft provided solution becomes available it would be easy to swap in.

ProcessQueueLoop async - The problem is not the ManualResetEventSlim, but the BlockingCollection. Making ProcessQueueLoop asynchronous cannot be safely done without writing a BlockingCollection that is asynchronous. As this is production code, not test code, that would be under much higher scrutiny and have a much higher risk of introducing other issues. Without writing or finding an asynchronous BlockingCollection, this is the only safe way to ensure that the blocking calls (implicit in the foreach statement) are on a long-lived thread and don't accidentally end up on the thread pool. It probably would be worthwhile to explain this in comments in the code to prevent someone from attempting to fix it without finding an asynchronous BlockingCollection.

phxnsharp commented 4 years ago

I'm more than happy to get on a call at some point if it would help our understanding of each other! Just let me know.

phxnsharp commented 4 years ago

If you want to use it, here is an AsyncManualResetEvent. I have no idea if it is "good". https://github.com/StephenCleary/AsyncEx/wiki/AsyncManualResetEvent

phxnsharp commented 4 years ago

The same package has a potential candidate for an async BlockingCollection: https://github.com/StephenCleary/AsyncEx/wiki/AsyncCollection

That package is MIT license. You are Apache-2. I don't know what your rules are for incorporation.

Falco20019 commented 4 years ago

Thanks for your input. I just recently used ConcurrentQueue<T> as asynchronous blocking collection. According to https://apisof.net/catalog/System.Collections.Concurrent.ConcurrentQueue%3CT%3E it also is available for our targets. So this might also help. So this might solve some of our problems too! I need some more time the next days to look into it. If you have time, feel free to check if it would help us. As it's part of the framework, it would also not create any licensing topics to analyze.

phxnsharp commented 4 years ago

I just recently used ConcurrentQueue<T>

BlockingCollection is actually built on top of ConcurrentQueue: https://referencesource.microsoft.com/#system/sys/system/collections/concurrent/BlockingCollection.cs. It uses SemaphoreSlim to implement the blocking aspects, so it potentially could be turned into an Async version with relatively little pain. Except the SemaphoreSlim objects are private so the only way to implement it would be to copy-pasta the whole thing and modify :(

phxnsharp commented 4 years ago

@Falco20019 This is serious issue for us as it occurs very frequently and looks really bad to the user. My patch, while not the ultimate architecturally correct solution, does not make anything worse than it already was. In fact, the solution given prevents blocking calls on thread pool threads in a way that is easy to understand and won't be hard to change later when an asynchronous blocking collection is created or found. Our next release is planned for later this year. Is there any chance this could be merged in time for us to use it for then? Thanks and sorry to be a pain!

Falco20019 commented 4 years ago

@phxnsharp Sorry for the delays. I‘m currently busy with some other work that I need to get done by the end of this week. I try to give it a look as soon as possible. There is another ticket at #185 that requires a 0.4.1, so I will move this to that version too.

Falco20019 commented 4 years ago

Thanks for your patience. I decided to merge this work-around to resolve your issue for now. I want to give it another look before 1.0 though.

You can reference the new version released as 0.4.1

phxnsharp commented 4 years ago

Thank you!

phxnsharp commented 4 years ago

@Falco20019 I have not tried these, but ran across a couple of potential nice solutions to fix this the right way:

The class BufferBlock from https://www.nuget.org/packages/System.Threading.Tasks.Dataflow/

The new System.Threading.Channels package: https://devblogs.microsoft.com/dotnet/an-introduction-to-system-threading-channels/

jaegertracing / jaeger-client-csharp

Fix Lost Spans on Close #182

Which problem is this PR solving?

Short description of the changes