serilog / serilog-sinks-periodicbatching

Infrastructure for Serilog sinks that process events in batches.
Apache License 2.0
71 stars 29 forks source link

Dispose can cause application to hang #21

Closed jsearles closed 5 years ago

jsearles commented 6 years ago

I've been testing failure scenarios with a sink that relies on this base class. What i've found is that if the queue is failing to log during disposal, the application will hang for an extended period while the sink continually backs off and retries sending the messages. The retry and backoff settings are not exposed, so there is no way to control this. In normal use cases the backoff/retry settings in place are fine, however during disposal we would want to fail more quickly to prevent the application from hanging. Perhaps exposing a timeout duration during disposal would be a simple approach that could resolve the issue at hand. What do you think?

nblumhardt commented 6 years ago

Hi, thanks for the note!

The Dispose() method calls CloseAndFlush() directly, which itself directly calls OnTick(); it's only the timer reset at the end of OnTick() that's impacted by exponential backoff/retry as far as I recall. Could the pause be elsewhere?

Any help tracking this down to a precise call graph/path would be appreciated, the sink is definitely designed with the intention of avoiding this kind of behavior :+1:

Thanks! Nick

jrsearles commented 6 years ago

Thanks for the reply - i'll see if i can get a reproduction to narrow it down.

jsearles commented 6 years ago

OK - i've been unable to replicate this in isolation and looking at the how disposal is handled here i believe the problem is more likely in our code. Thanks for your time.

jsearles commented 6 years ago

I've figured out the cause of the hanging - if there are queued messages being processed during shutdown and those messages are failing, it will continue the retry logic, causing the application to hang until retries are exhausted.

skomis-mm commented 6 years ago

@jsearles can you provide some repro example? I could not reproduce this. Disposing sink turns off the retry logic so when exception is thrown the queue will stop processing. Is there some custom retry logic inside overriden EmitBatchAsync(..)?

jsearles commented 6 years ago

OK - i'm able to reproduce it reliably in a sample app. Looks like a deadlock that is occurring here. I'll put together a repo i can share.

nblumhardt commented 6 years ago

@jsearles great! Sounds like we're closing in on it.

Is this a publicly-available sink, or a custom/internal one?

gregpakes commented 5 years ago

Is there any news on this? I am getting periodic thread exhaustion in my application and it points to the same issue... I have 514 threads all blocked with the following stacktrace:

image

skomis-mm commented 5 years ago

Hi, @gregpakes . Can you give more help on this?

  1. What sink do you use?
  2. If sink is custom one, how its implementation looks like (overriden EmitBatchAsync method)?
  3. How does logging setup look like?
  4. Is it possible to get minimal repro example?

Thanks!

gregpakes commented 5 years ago

Hi,

  1. This is being used by the Seq sink.
  2. It is not custom (https://github.com/serilog/serilog-sinks-seq)
  3. Logging setup looks like this in our IOC container:
 init.For<ILogger>()
                            .Use(
                                ctx =>
                                    {
                                        var seqServer =
                                            WebConfigurationManager.AppSettings["serilog:write-to:Seq.serverUrl"];
                                        var seqApiKey =
                                            WebConfigurationManager.AppSettings["serilog:write-to:Seq.apiKey"];
                                        var appService = ctx.GetInstance<IApplicationService>();

                                        var logger = new LoggerConfiguration()
                                            .WriteTo.Seq(seqServer, apiKey: seqApiKey)
                                            .Enrich.WithProperty("Version", appService.GetVersion())
                                            .Enrich.WithProperty("Client", appService.GetClientName())
                                            .CreateLogger();

                                        return logger;
                                    });
  1. I think this will be tricky. We are running this on hundreds of sites and it crops up very rarely.
skomis-mm commented 5 years ago

@gregpakes , thanks for the info, it seems that you use StructureMap. Can you check that this registration is actually singleton? As a rule, Logger should be created once. If registration is transient then each resolve operation will create new instance of the logger with its sinks (and timers) that won't be correctly disposed.

gregpakes commented 5 years ago

Thanks - I will take a look. It doesn't look like a singleton at the moment.

nblumhardt commented 5 years ago

Closing this, assuming that the problem was container configuration; but let us know if not, folks :-) Cheers!

kosmakoff commented 4 years ago

Hi. I'd like to report that this is still an issue, at least for my configuration.

I am using the AWS CloudWatch Sink based on this project, and sometimes during automatic tests the same deadlocking behavior occurs.

The serilog logger is registered with .NET Core DI container as scoped. It is important that it is scoped, as it is being used in AWS Lambdas, where the container is reused across requests, and the new ILogger instance is resolved per request, so that I have control over it's disposal - I need to be sure that all buffered log messages get written to AWS.

skomis-mm commented 4 years ago

Hi, @kosmakoff . I would not recommend to scope ILogger instances. If you only need to ensure that every message should get written to AWS you can flush events at the end of the application lifetime or test run in your case via Log.CloseAndFlush or ILogger.Dispose.

There are several problems scoping ILogger instances per request:

  1. the I/O sinks which utilize PBS are designed to be singleton per endpoint. If you look at the implementation of the AWS CloudWatch sink you will see it has custom buffering, limitation and retry logic. When scoping ILoggers per request it will result in parallel logging pipelines with its own state and logic around it.
  2. when you dispose the logger the actual disposal is synchronous. But since sinks's flushing is I/O call it will result in slow calls. And when its slow, and it is for every request, and every request is handled by thread pool thread its very easy to have thread starvation that leads to hangs/timeouts (not deadlocks). And this side effect is global and affect entire application which is very bad.

If its really deadlocks, it would be great if you could provide more info such as stack traces of the hanged threads, configuration etc or minimal project that can help to reproduce this. Thanks!

nblumhardt commented 4 years ago

or ILogger.Dispose

Just a tiny clarification, @skomis-mm means Logger.Dispose, here.