Thread starvation on asp.net core from logging

ghost commented 3 years ago

Environment details

OS: ubuntu on gke
.NET version: 5
Package name and version: Google.Cloud.Diagnostics.AspNetCore3 4.2

Steps to reproduce

use the official way to hookup an asp.net 5 app to google diag
log intensively

Hi fine folks. Using the official asp.net core diag package i'm getting thread starvation due to the way logging is using synch semaphores to protect the buffer.

The simple proof is the next screenshot from VS for a dump i collected from prod.

The diagnostics was setup like this

  webBuilder.UseGoogleDiagnostics(
            projectId: "",
            serviceName: "",
            serviceVersion: "",
            loggerOptions: LoggerOptions.Create(
                logLevel: LogLevel.Information,
                loggerDiagnosticsOutput: Console.Out,
                logName: logName,                         
                monitoredResource: new Google.Api.MonitoredResource
                {
                    Type = "k8s_container",
                    Labels =
                    {
                            { "project_id", ""},
                            { "location", "us-east1-b"},
                            { "cluster_name", ""},
                            { "namespace_name", "default"},
                            { "pod_name", Environment.MachineName }
                    }
                }));

I think the bug is related to the semaphore slim used to protect the List that acts as a buffer. This is used on the Receive method, which is invoked from the common GoogleLogger.

I know it's protecting a small operation over an in-memory data structure, but it does mess up my process. Why aren't the .net concurrency-friendly collections used instead of manual sync over non-concurrency friendly collections?

lmk if you need more info or in case you believe there's smth wrong with how I use the gcp libs.

Thanks for looking into this!

amanda-tarafa commented 3 years ago

I will look at this on Monday.

amanda-tarafa commented 3 years ago

Also if you can please take a look at this comment on this other issue. It's the only known cause for thread starvation but it's related to how the Google credentials are obtained https://github.com/googleapis/google-api-dotnet-client/issues/1807#issuecomment-818861719

ghost commented 3 years ago

Thanks for looking into this @amanda-tarafa

I only use 2 GCP services from my app, the diag suite and object storage. For Object Storage i have async initialization (StorageClient.CreateAsync()) on main, between Build and Run.

For now i've removing gcp diag to keep the service alive, but i would like to add it back. Like it a lot.

Before switching it out, i've analyzed a few other dumps and all were stuck in logging, waiting for that semaphore.

I'm also setting the min worker threads using the ThreadPool api SetMinThreads (not touching iocp threads since i'm on linux).

amanda-tarafa commented 3 years ago

Can I ask if what you are seeing is the same thread(s) never getting a chance to log or just many different threads competing for a single resource? Can you see in the dumps for how long has a thread been stuck there? Also, when you say "log intensively", can you give numbers? How many threads are trying to log at the same time and at which frequency? I want to try and reproduce.

I'm pretty sure there's no deadlocking bug or anything similar, just that this is a "scarce" resource. I'll look at changing the implementation for using one of the concurrent collections and that might improve things some, but still, this will be a botleneck and I'm not sure how we can get around that (without major library rewriting). Notice that making the operations async is not an option because the Microsoft.Extensions.Logging.ILogger interface does not define an async overload.

ghost commented 3 years ago

Hi @amanda-tarafa

I only have mini dumps as full dumps killed the pod; I'll try to analyze them later when i get some time. What I have observed is:

pods with high thread count up to tens of thousands (i know it's crazy, but I've seen it); tbh for me even few hundred is a signal that smth is wrong. I've seen the service becomes seriously degraded after 1k.
I think it's multiple threads that cannot log, not few well-known threads. All my code is async, not manipulating threads explicitly.

Regarding numbers, I do expect to have few thousands log writes / sec.

The execution environment is GKE with Ubuntu OS; I am not exactly sure if this makes any difference..

amanda-tarafa commented 3 years ago

Thanks, I'll try to reproduce and work on using the concurrent collections to see if that makes any difference. I'll report here when I know more. Anything particular about your GKE configuration?

ghost commented 3 years ago

Nothing i can think of; latest gke, ubuntu, some daemon sets, sometimes extensive high cpu usage on nodes, max 32/64 pods / node, though I rarely see more than a few tens of pods. I'm on GCP support, I can open a case and share private info there, if that helps.

amanda-tarafa commented 3 years ago

Thanks, and good to know we can get more detailed info if we need to. Let me try with what you have shared first and we can go from there.

amanda-tarafa commented 3 years ago

OK, I've managed to reproduce this. I've now set up some benchmarking so it'll be easier to see if using concurrent collections results in any improvements. If it doesn't, we'd have to think of some more drastic changes, but let's get there first.

ghost commented 3 years ago

How did you repro this? Just by logging at high scale?

On Fri, Sep 24, 2021 at 4:10 PM Amanda Tarafa Mas @.***> wrote:

OK, I've managed to reproduce this. I've now set up some benchmarking so it'll be easier to see if using concurrent collections results in any improvements. If it doesn't, we'd have to think of some more drastic changes, but let's get there first.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/googleapis/google-cloud-dotnet/issues/7177#issuecomment-926656677, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJF5YQPY2DWWNGEKC6EJXZ3UDSBFNANCNFSM5EIHQMMA .

-- Catalin Adler VP of Engineering, DevFactory

amanda-tarafa commented 3 years ago

Yes, I can share the full code later, but basically:

A .NET 5.0 web api, based on the default template (WeatherForecast).

Configure Diagnostics like you (except for the monitored resource and the logger diagnostics output bit, which are not relevant)

public static IHostBuilder CreateHostBuilder(string[] args) =>
Host.CreateDefaultBuilder(args)
.ConfigureWebHostDefaults(webBuilder =>
{
    webBuilder
    .UseGoogleDiagnostics(
        projectId: "...",
        serviceName: "logging-thread-starvation",
        serviceVersion: "0.0.1",
        loggerOptions: LoggerOptions.Create(
            logLevel: LogLevel.Information,
            logName: "log-thread-starvation"
        ))
    .UseStartup<Startup>();
});

Modify the controller get method so that it attempts to logs and measures and returns how long that takes:

[HttpGet]
public long Get()
{
var stopWatch = Stopwatch.StartNew();
_logger.LogInformation($"{nameof(WeatherForecastController.Get)} called.");
stopWatch.Stop();
var loggingTime = stopWatch.ElapsedMilliseconds;

// Let's still do something here to mimick real life.
var rng = new Random();
var fakeResult = Enumerable.Range(1, 5).Select(index => new WeatherForecast
{
    Date = DateTime.Now.AddDays(index),
    TemperatureC = rng.Next(-20, 55),
    Summary = Summaries[rng.Next(Summaries.Length)]
})
.ToArray();

return loggingTime;
}

I deployed that to a default Kubernetes cluster.

Then I have a console app making requestes to this web api and collecting stats. I had to deploy this one to another Kubernets cluster so as to be able to generate good enough volume.
The stats are roughly the following (consistent when running for roughly 3hrs):
- 3k - 5.5k requests per second to the web API. These are requests to the server app and each attempt a logging operation.
- Roughly 1.2% of the requests fail, probably with a timeout although I didn't save the failure reason.
- Max logging delay: 5800ms
- Min logging delay: 0ms
- Avg logging delay: 7ms
- 90% of logging attempts take 4ms or less
- 95% of logging attempts take 19ms or less
- 99% og logging attempts take 206ms or less
Running the same client against the same web api, but with the attempt to log commented out, no requests fail at all. (The dalays stats are all zero because we are not logging).

amanda-tarafa commented 3 years ago

OK, after much trials here are my findings:

The issue is not in Google.Cloud.Diagnostics.AspNetCore* but in one of the default loggers added by .NET 5 (or maybe the interaction of one of those with GKE).
You can check this yourself by clearing all other logging providers before registering Google Diagnostics:
```
 webBuilder
     .ConfigureLogging(builder => builder.ClearProviders()) // This will clear all default loging providers.
     .UseGoogleDiagnostics( ... ); //same as you have in your code snippet
```
- My stats for the above are as follows (for similar volumes of 3k to 5.5k per second logging attempts)
- 0 requests failing
- Max logging delay: 1465ms
- Average logging delay: 0ms
- 99% of requests took to log: 0ms
If instead you don't register Diagnostics at all, and leave the default logging providers (by not calling either of the two methods called on the snippet above) and attempt to log (which will use the dafault logger providers), then you will see similar issues as before. My stats for this scenario are as follows:
- ~0.9% requests failing
- Max logging delay: 4671ms
- Average logging delay: 3ms
- 90% of requests took to log: 0ms
- 95% of requests took to log: 9ms
- 99% of requests took to log: 144ms

So, again, this pretty much confirms that there's no issue with Google.Cloud.Diagnostics.AspNetCore*. I will created an issue for .NET 5 later today or tomorrow so they can look into it.

Can you please confirm on your end?

ghost commented 3 years ago

Thanks for this @amanda-tarafa !

I'll run an experiment and come back to you. Would be good to get an RCA for this, it is curious.

amanda-tarafa commented 3 years ago

Yes I'll try to get to the bottom of it. I think it's either one of the default loggers or the agent running on the pod, that collects stdout logs. I think I can easily answer that with the setup I already have.

amanda-tarafa commented 3 years ago

OK, I've gotten as far as I can go. The culprit is the Cloud Logging agent that runs by default on gke, which captures stdout and stderr and sends it to Google Cloud Logging. Disabling ASP.NET Core default loggers works, because then there's nothing writing to stdout and stderr so there's nothing for the agent to capture.

The trial that easily confirms that the performance issues lie with the agent is as follows.

ASP.NET Core app started:

public static IHostBuilder CreateHostBuilder(string[] args) =>
    Host.CreateDefaultBuilder(args)
        .ConfigureWebHostDefaults(webBuilder =>
        {
            webBuilder
            .UseStartup<Startup>()
            .ConfigureLogging(builder => builder.ClearProviders()); // Let's make sure there's no logging provider at all, so that we know they don't interfere.
        });

And then a controller action:

[HttpGet]
public long Get()
{
    var stopWatch = Stopwatch.StartNew();
    Console.WriteLine("Logging to test the agent");
    stopWatch.Stop();

    return loggingTime;
}

With this app deployed in a default gke, and having a client that generates about 7k requests per second, we get the following stats.

~0.2% requests failing
Max delay: 1878ms
Average 1ms
95%: 0ms
99%: 22ms

And doing so basic structure loggin, these are the stats:

~0.9% requests failing
Max delay: 1972ms
Average 2ms
95%: 0ms
99%: 54ms

In turn, if the agent itself is disabled (only for Workloads, it can remain enabled for system), then any combination of writing directely to stdout, ASP.NET Core Logger Providers and Google.Cloud.Diagnostics.AspNetCore* works perfectly (0 requests fail) for a volume of 7k request per second.

I'll create an issue for the Logging agent team and report back here when I know more.

For now, you'll either have to disable the agent (just for Workload) or disbale ASP.NET Core Console Logger.

ghost commented 3 years ago

Hi @amanda-tarafa I just pushed the code in prod yesterday and haven't observed any issues. I'll keep monitoring and let you know.

amanda-tarafa commented 3 years ago

That's good to know @CatalinAdlerDF . Do let me know if you see any issues, and I'll report back here when I hear from the Logging Agent team.

amanda-tarafa commented 3 years ago

@CatalinAdlerDF any news on your experiments?

If everything is fine, I'd like to close this issue as there's not much we can do on the repo. Of course you can leave a comment if the issues reappear or create a new issue.

I will come back here and update when I have more information from the Logging Agent team.

ghost commented 3 years ago

Hi Amanda,

I'll close this for now, though there are still some concerns I have on thread starvation. Need to repro and dump a snapshot and in case it's related to gcp, i'll reopen this.

I do suggest you document this somewhere.

Thank you for your assistance!

amanda-tarafa commented 3 years ago

Thanks!

Yes, I'll add some notes on this on Google.Cloud.Diagnostics docs once I know more from the agent team. This might very well be due to a bug that they fix, and then there's no need to document it.

Again, regardless of the outcome, once I know more I'll come back here.

ghost commented 2 years ago

@amanda-tarafa I still get significant blockage on logging 90 threads waiting for that semaphore.

amanda-tarafa commented 2 years ago

Can you share, just so I can try and reproduce:

Which of the workarounds are you using: a) did you disabled the logging agent on the cluster or b) did you cleared .NET logging providers.
What's the volume of logging requests, still around 7k per second?
You "implemented" the workaround a couple of weeks ago and at first reported that everything seemed fine. Did the blockage reappeared suddenly/recently?
- What was happening a week ago?
- Are you aware of any upgrades on your cluster that happened before these reappeared? Anything else changed that you know off?

ghost commented 2 years ago

Which of the workarounds are you using: a) did you disabled the logging agent on the cluster or b) did you cleared .NET logging providers.

Cleared logging providers.

What's the volume of logging requests, still around 7k per second?

Way small, maybe 100.

Are you aware of any upgrades on your cluster

I did upgrade to latest GKE.

Nothing major happened, just maybe a little bit more load. The current mitigation has been to separate load by endpoint: 1 endpoint -> 1 k8s deployment. bulkhead fault isolation. I have few endpoints which appear to be causing this, but they are also quite simple and they are some of the most active endpoints, by a significant margin.

The experiment i am running now is to switch to serilog for only the deployment which is most effected (had like 30 recycles in 2h)

amanda-tarafa commented 2 years ago

Way small, maybe 100. Nothing major happened, just maybe a little bit more load.

I'm a little confused by these two statements. Before we were working on the assumption of several 1000s logging requests per second (I ran my experiments with ~7k logging requests per second). Now you say it's maybe 100 logging requests per second but that this is more load than before?

The experiment i am running now is to switch to serilog for only the deployment

I'm not entirely certain what you mean by this? Do you have a separate app that deploys the main app, and now the deployment app is using serilog and the main app is using Diagnostics?

BTW, that deployment is the most affected phase wouldn't surprise me given my conclusion that this is the logging agent, specially if your workaround has been to clear .NET providers. Many underlying sytems may be writing to the console at deployment time (by removing .NET providers it's just the app that's not writing to the console, but everything else is). Since the agent remains active, it's still consuming resources, which ends up starving everything else. Maybe run an experiment by leaving in the providers and deactivating the agent?

Next week, when you have clarified these remaining details, I'll run some more tests. And I'll also follow up with the logging agent team.

amanda-tarafa commented 2 years ago

One last question on your comments:

I did upgrade to latest GKE.

So, you saw the blockage again after you upgraded to latest GKE? Before that and after implementing the workaround, everything was fine?

ghost commented 2 years ago

I'm a little confused by these two statements.

Apologies. inbound http requests in maybe the hundreds, log request / second probably a lot, but did not measure recently.

I'm not entirely certain what you mean by this? Do you have a separate app that deploys the main app, and now the deployment app is using serilog and the main app is using Diagnostics?

For the k8s deployment where I have observed most of the problems i am switching from logging to stackdriver using your libs to using serilog with a stackdriver sink. If i notice a significant difference in behavior, then the culprit is the gcp lib.

So, you saw the blockage again after you upgraded to latest GKE? Before that and after implementing the workaround, everything was fine?

I cannot confirm/infirm this. I am just putting in more effort these days to fix this.

amanda-tarafa commented 2 years ago

Acked on all, thanks!

If you find serilog works significantly better, can you please share your serilog configuration so I can run some tests on that?

Note that a couple weeks ago I ran a simple test (re-sharing):

Logging Agent enabled.
All providers cleared, including no reference to Diagnostics.
And then my only action was:

[HttpGet]
public long Get()
{
    var stopWatch = Stopwatch.StartNew();
    Console.WriteLine("Logging to test the agent");
    stopWatch.Stop();

    return loggingTime;
}

This still yielded pretty poor results for ~7k logging requests per second:

~0.2% requests failing
Max delay: 1878ms
Average 1ms
95%: 0ms
99%: 22ms

As soon as I deactivated the loggin agent, all requests were successful and the 99% of requests were done in less than 2ms. Since the Diagnostics library (or any other logging library) was not involved here, this test makes me almost 100% certain that there's a performance issue with the logging agent. As soon as I deactivated the logging agent, I could directly use Console.WriteLine, the MS Console Logger and the Diagnostics library all together with no issue at all.

One thing you can check as well, are any of the more affected services using Console.WriteLine at all anywhere? Maybe some leftover debug code or similar? Or maybe some dependency that logs directly to the Console? Again, one very quick to test for this is to log using all default .NET providers and Diagnostics, but having the logging agent disabled. I'm pretty certain you won't see any issues there.

ghost commented 2 years ago

First overnight observation. so yesterday at my evening (around 5pm) i switched to using serilog with a stackdriver sink for the top offender (the deployment which suffered the most).

for no issues when switching to serilog, while i am still observing issues with the deployments left on using gcp libs directly.

so the situation is: lots of other deployments use gcp lib directly and 1 uses serilog with a stackdriver sink. direct gcp lib code is (apologize for the formatting)

 hostBuilder.ConfigureLogging((context, logging) =>
    {
        logging.ClearProviders();
    })
.ConfigureWebHostDefaults(webBuilder =>
{
    var logName = "....";
    // Also set the no buffer option so that tracing is attempted immediately.
    var bufferOptions = BufferOptions.TimedBuffer(TimeSpan.FromSeconds(5));

    webBuilder.UseGoogleDiagnostics(
            projectId: ".....",
            serviceName: "....",
            serviceVersion: "vNext",
            loggerOptions: LoggerOptions.Create(
                logLevel: LogLevel.Information,
                loggerDiagnosticsOutput: Console.Out,
                logName: logName,
                bufferOptions: bufferOptions,
                monitoredResource: new Google.Api.MonitoredResource
                {
                    Type = "k8s_container",
                    Labels =
                    {
                            { "project_id", "...."},
                            { "location", "us-east1-b"},
                            { "cluster_name", "...."},
                            { "namespace_name", "default"},
                            { "pod_name", Environment.MachineName }
                    }
                }));
);                
    webBuilder.UseStartup<Startup>();
});

The relevant serilog code is

hostBuilder
.UseSerilog((hostingContext, loggerConfiguration) =>
    {
        loggerConfiguration.ReadFrom.Configuration(hostingContext.Configuration);
    })
.ConfigureWebHostDefaults(webBuilder =>
{
    webBuilder.UseStartup<Startup>();
});

with the following config file

 "Serilog": {
    "Using": [ "Serilog.Sinks.GoogleCloudLogging" ],
    "MinimumLevel": "Information",
    "Enrich": [ "FromLogContext", "WithMachineName", "WithExceptionDetails" ],
    "WriteTo": [      
      {
        "Name": "GoogleCloudLogging",
        "Args": {
          "projectID": ".....",
          "useJsonOutput": "true",
          "useSourceContextAsLogName": false,
          "logName": ".....",
          "serviceName": ".....",
          "serviceVersion": "vNext",
          "useLogCorrelation": true
        }
      }
    ]
  }

I plan to switch the next top offender to using serilog directly and figure out a way to automatically get minidumps when thread count goes above a certain threshold.

I've reviewed the FlushableConsumerBase code and i think the way to repro the issue is high concurrency on writing log entries in a memory bound & with a fragmented heap environment. iirc the List doubles in size if the behind the scenes arrays cannot fit the item count. So if the runtime cannot quickly find a place to allocate a new huge array, then it might trigger GC to make some room for it, which leads to pauses and maybe even failures (what if there is no room left?). And i think this is why you haven't exactly reproduced my scenario: in your sample there is no memory pressure.

Try to limit .net to a small heap, create a use case of high concurrency of logging and make sure you do lot at high rate (multiple tasks, each in a tight loop maybe) and allocate randomly and absurdly.

I think an easy fix is a concurrency aware write optimized data structure, maybe the ConcurrentQueue, or maybe not even a concurrency-aware structure. I can try to compile the code on my machine and if successful i can experiment.

amanda-tarafa commented 2 years ago

Thanks for the extensive report, I'll try with serilog on Monday

amanda-tarafa commented 2 years ago

I still haven't gotten around to running more tests, I will do so this week, including with serilog, and report back (I'll share my tests and full results as well so you can try them on your own if you want). Still I've been thinking about this over the weekend, and wanted to ask for a little more information. Here's why:

I think our main point of divergence has been what we report as issues:
- You've been focusing on the memory dump and report an issue if the amount of threads you see there is high at any given point in time.
- I've been measuring service request failures and logging requests times, and report an issue if some service requests fail and logging requests times are greater than a few milliseconds.
I think we should attempt to converge these approaches, and I think my approach is more accurate in detecting issues than yours. Here's why:
- On a high concurrency scenario, might be because your service is receiving many requests, and/or because code execution is spawning several threads (even if not explicitly), and many of those are attempting to log, then yes, at any given point in time there may be many of those threads waiting for the lock. After all, that's precisely what a lock does. The issue is not how many threads are waiting, the issue is how long they wait for, and what that does to the service response times overall.

Taking these things into account:

When you first looked at the memory dump because your services were degraded, what exactly did you mean by degraded? Failed service requests, requests taking too much time? Can you share some data, even if approximate?
After you cleared all .NET logging providers and continued to use Diagnostics:
- Were the services still degraded?
- Or were you just looking at the memory dump as a means to check that there were no issues?
- If the services were still degraded, had degradation improved with respect to when you hadn't cleared the .NET logging providers?
For the serilog tests, did you also cleared the other .NET logging providers?
- If yes, can you add them back in and check for service degradation?
- If no, can you check for service degradation and not just the memory dump?

Basically, even if you don't agree with my reasoning, it would be very useful if you could share the information about service degradation rather than, or in addition to, the memory dump. Failed service requests, and service requests times would be great, for the four combinations: Diagnostics + .NET Logging Providers, Diagnostics - .Net Logging Providers, Serilog + .NET Logging Providers and Serilog - .NET Logging Providers.

Also, if you want to test the concurrent collection implementation, take a look at #7258. This works well, but does not improves issues (as measured in failed service requests and logging times, in the presence of the logging agent) so we decided not to merge it.

ghost commented 2 years ago

Just a short answer for now. I use a liveliness probe configured using AspNetCore.HealthChecks.* for mysql and redis. In k8s it's configured to hit the /health endpoint every 60s and use a 10s timeout.

livenessProbe:
           httpGet:
             path: /api/health
             port: 80
           initialDelaySeconds: 30
           periodSeconds: 60
           timeoutSeconds: 10

K8s recycles the container when the probe fails (i think 3 times in a row). When a recycle occurs i get 502 on client side, and unhappy customers.

I am observing count of recycles.

I am also using Ben Adams, BlockingDetector for ASP.NET Core to log blocking operations.

These evidence suggests that threads are blocked and unable to handle incoming requests; going down the rabbit hole, I ended up taking mini dumps when thread count goes unusually high, and that is where I've observed gcp libs being the culprit. I have empirical evidence that when thread count goes up too much, i get liveness probes failures and recycles. I do not have that high load to justify 100 threads (and I've actually seen 3-400 easily), the top service had like under 100 inbound http req / s, and all the code that fans out processing to multiple concurrent tasks in the http handlers has been scoped down on degree of parallelism to control load on the system (mysql fails before .net does) to like a dop of 1-5.

Till now, i have 2 deployments moved from gcp libs to the serilog setup and since the change 0 recycles, meaning 0 liveness probe failures. The services which still use the gcp lib directly are still suffering recycles. as I type this.

Another thing i've noticed which is weird is high degree of heap fragmentation, like even 60%. Haven't see high count of GC events though. But i am not closely monitoring these metrics tbh, but i think these might fit into the List grow algorithm theory well.

For the serilog tests, did you also cleared the other .NET logging providers?

You have the serilog code above; not doing anything but reading from config file which i posted above.

The issue is not how many threads are waiting, the issue is how long they wait for, and what that does to the service response times overall.

I partially agree with you; the trigger is that the wait time is too long, but the RCA is why are we even blocking threads. It's just too easy to get into problems when you have blocking on hot paths (too many stories similar to this one to count), and I think the better solution is, not to get blocking work, but remove it altogether, it just avoids risk.

I don't think i have the time to play around with various solutions for this, but my thoughts are around why is the lock even needed (bc in the end it protects the item list itself), maybe a volatile ref or a lock free solution would be better.

I'll make one more change that might help prove this: move to a size based buffer on the idea that it will lead to less pressure on the memory subsystem, so maybe less waits.

Update 1: i realize this is not a short answer :D

amanda-tarafa commented 2 years ago

I was writing the following reply when it occured to me to check something. Serilog ignores all other providers by default. You can see their documentation on Enabling Microsoft.Extensions.Logging.ILoggerProviders. It would be super useful if you could run serilog with the writeToProviders set to true and see whether that makes the services degrade.

I'm still finishing my reply below because it still holds:

Thanks for the not so short answer :). All info helps, and is needed as I what I can reproduce it's different from what you are seeing.

First thing is, this library is old, and definely, if we were starting from scratch some things would be done differently (which probably includes the not using the lock part). But we now have to maintain the existing code, withough breaking existing costumers, adding new features, etc., so, code that "works" stays as is. Please rest assured that if there's an issue with the Diagnositcs library we will do our best to fix it. I'm more than happy to merge #7258 but I honestly think there's an underlying issue here that's bringing out the worst part of locking.

Basically I want to find out why the wait time for obtaining the lock is so long, as it shouldn't because the operations inside the lock are very simple; unless there's no CPU available, or similar. In my tests, the one causing the long wait times was the logging agent. But that doesn't seem to be your case, and I still want to know why before making changes to the library.

As for what you are seeing in your probes, that may still match with Diagnostics blocking not being the issue, but just a symptom. Let's say that there's something consuming lots of CPU (can you check CPU usage?)

With Diagnostics threads get stuck on the lock because of lack of CPU time.
With serilog (assuming code is async) threads don't get stuck, but some of the logging code is also not executed until there's CPU time available. Requests to your service are still delayed by this.
With Diagnostics, when the health check comes in, there are no threads to attend to it, the thread pool is starved at this point, so, health check fails, so you get recycles.
With Serilog, there are threads available, so the health check comes through and you get no recycles. But still your requests might be waiting to log, and your service might be replying slower than it should.

So again, ideally we could see request times for the 4 combinations:

Diagnostics + .NET Logging Providers
Diagnostics - .Net Logging Providers
Serilog + .NET Logging Providers
Serilog - .NET Logging Providers

When using Diagnostics, you also have tracing enabled for sure, through the library, so you might have a sampling of requests already, with how long they took, or even if they timed out. And since you are running on k8s, I think you will have some tracing information regardless of whether you are using Diagnostics or not, so, for serilog there might be something you could look at already as well.

I'll still run the tests on my side and report back. And we'll get there.

ghost commented 2 years ago

Hi @amanda-tarafa thanks for your similar-in-size answer. Before i act on it, so more observations; after i've changed the buffer type to sized-based, i've barely noticed any recycles (so little that i ignore them), so it does appear to have had an impact.

cpu is not used at high rate afaik.

tbh i got confused on your comments on diagnostics / .net logging providers. need some time to digest it.

amanda-tarafa commented 2 years ago

after i've changed the buffer type to sized-based, i've barely noticed any recycles (so little that i ignore them), so it does appear to have had an impact

The simple explanation for this is that the timed-based buffer is of course triggered by a timer, the timer triggers every 5 seconds by default. This grabs the lock to swap the list reference, releases the lock, and after releasing the lock writes the entries to the server. 5 seconds should be more than enough to swap a reference, but on a degraded system, threads might be getting stuck in here (as well as in logging). This would also account for the amount of stucked threads you see not matching the load you have, some of those stucked threads are the timer ones. This doesn't happen on sized-based because the list reference is swapped on a request thread. We've seen this happen on environments that throttle CPU, like Cloud Run, and we've added a note (second note top down) to our docs about that. I wonder, have you set CPU limits on your pods? I didn't think of this earlier, but if you set low CPU limits to your pods, then that might very well be part of the problem? There's a medium article describing some of this and in particular look for "Checking the throttling rate of your pods", if throttling is high enough, then that would explain (almost) everything.

It could even explain why removing the the .NET Logging providers works for me but not for you. My application is very light and I removed the "only other" thing that might be cosuming CPU and working towards those limits, whereas your application is heavier and you continue to use mysql, Storage, etc.

(let me know if you want me to go into more detail about the .NET Logging providers, etc.)

ghost commented 2 years ago

I don't think it's like that. I think it's more like: with a time based buffer, if you have tons of logs in between timer callbacks, then you get into List growth algorithm which is double-in-size which is a operation that depends on the state of the heap; i used the 5s buffer. I have 16 cpu as a limit. But afaik this is not how k8s uses the limits; the limits are not hard limits; they just affect the priority of the kill algo, pods which use more than their defined limits will be killed before pods which are under their limits.

amanda-tarafa commented 2 years ago

Hmmm, yes, that is happening, but I didn't see any of that being a problem on my tests, with 7k log requests/second, which means 35k log requests in between timers, wich means that the list grows several times for me as well.

For CPU limits I didn't mean total CPUs, but allocation per node:

I'm not an expert in k8s tbh, but I will check throttling when running my tests and see if there's something there on my side.

ghost commented 2 years ago

wich means that the list grows several times for me as well.

It needs to grow while heap is under pressure; this is to make that allocating a double-in-size array takes a lot of time.

in your example on cpu limits, it reads like this "213mcpu have been reserved and 940 are available for reservation". with how containers work, that has nothing to do with how much cpu a pod can actually use; you can reserve 1 cpu and use 4 if they are available.

I use e2 machines with some cpu today (2 or 4 i think)

amanda-tarafa commented 2 years ago

From Kubernets docs: How pods with resource limits are run

The spec.containers[].resources.limits.cpu is converted to its millicore value and multiplied by 100. The resulting value is the total amount of CPU time in microseconds that a container can use every 100ms. A container cannot use more than its share of CPU time during this interval.

... A Container might or might not be allowed to exceed its CPU limit for extended periods of time. However, it will not be killed for excessive CPU usage.

From Google docs on Kubernetes best practices (In the section Container settings -> CPU):

It’s when it comes to CPU limits that things get interesting. CPU is considered a “compressible” resource. If your app starts hitting your CPU limits, Kubernetes starts throttling your container. This means the CPU will be artificially restricted, giving your app potentially worse performance! However, it won’t be terminated or evicted. You can use a liveness health check to make sure performance has not been impacted.

So, I'm pretty sure your services' thread are being throttled. And that actually matches my theory and your observation:

after i've changed the buffer type to sized-based, i've barely noticed any recycles (so little that i ignore them), so it does appear to have had an impact.

If it was the list resizing on an under pressure heap (outstanding question is why does the heap is under pressure in the first place) that makes the lock operations take long and the threads stuck on the locks, then changing to sized-based would have entirely fixed the issues, i.e. no recycles at all. Your liveness healt check is still failing ocassionally because there's something else going on, and that is threads being throttle, because you are probably consuming at times more CPU than the one specified in the limit. And precisely the liveness health check is the one that's recommended to check whether you are being throttled and your performance impacted.

In my case, disabling the .NET log providers or the loggin agent gave me back enough CPU so that I wasn't throttled again. In your case, since you continue to do some other heavy operations, it's not enough to clear the providers, yo need to maybe set limits higher.

Mind you, I can very much see how given that threads are being throtle, Diagnostics would make things worst, specially the timed buffer because of all the timer threads that would get throtled themselves and end up starving the thread pool. But there needs to be something else going on. This won't happen on its own. The list sizing, at least with the volumes we are talking about, can hardly be the main cuplrit here, and probably won't even lead on its own to the heap fragmentation you are seeing:

max 10^5 log entries
allocate for 64b * 10^5 ~ 6mb shouldn't be that bad (proper benchmarking needed of course, but this doesn't even classify as a large object for the large object heap)
it's possible that the heap is fragmented just because the GC is being throttled as well, so it's not efficientling compressing the heap

Sorry to keep insisting, I just want things to really add up, and right now, they don't fully for me.

A few more questions:

When you moved to Diagnostics but cleared the default logging providers (I think over the weekend), did you notice any improvement in number of recyles? Were they less frequent?
Is the serilog pod being recycled at all? Are health checks there still failing ocasionally, even if not 3 times in a row?
Can you configure serilog as follows and report back on recycles and/or health check failures? (notice the writeToProviders: true which will just enable the default logging providers)

hostBuilder
.UseSerilog((hostingContext, loggerConfiguration) =>
    {
        loggerConfiguration.ReadFrom.Configuration(hostingContext.Configuration);
    }, writeToProviders: true)
.ConfigureWebHostDefaults(webBuilder =>
{
    webBuilder.UseStartup<Startup>();
});

ghost commented 2 years ago

A quick note, k8s has moved away from using docker as a container engine. the second observation is maybe valid, though i am somewhat doubtful. I am pretty sure that container is not killed when going over limit, it depends on context.

Though afaik the nodes where my code is executing are not having lots of cpu usage; i was not able to observe high cpu usage, just high heap fragmentation.

ghost commented 2 years ago

why does the heap is under pressure in the first place

Regular allocations during request processing and whatever other things are happening there. Keep in mind .net has heuristics to size the heap based on container requests and limits.

would have entirely fixed the issues

don't think 1-2 recycles on a period where i used to have tens count for much. There is other code there as well, can hit another problem as well.

Is the serilog pod being recycled at all?

no recycles on anything moved to serilog and only 2 recycles when using gcp lib; previously i had tens of recycles / day.

Can you configure serilog as follows and report back on recycles and/or health check failures?

I'll make the change today and we'll have some observations by tomorrow.

amanda-tarafa commented 2 years ago

A quick note, k8s has moved away from using docker as a container engine. the second observation is maybe valid, though i am somewhat doubtful. I am pretty sure that container is not killed when going over limit, it depends on context.

Yep, I don't think the container is being killed, I think threads are being throtled, which leads to health check failings, which leads to recycles.

only 2 recycles when using gcp lib

Is this with timed buffer or sized buffer? Are the .NET logging providers cleared or not? I'm very interested in knowing wether the Diagnostics configuration that you pasted in this comment https://github.com/googleapis/google-cloud-dotnet/issues/7177#issuecomment-955169695 made any significant difference with respect to not having cleared the default providers. For instance, after that change you mentioned that you had some top offenders, does this means that before that change, more services were affected?

Though afaik the nodes where my code is executing are not having lots of cpu usage; i was not able to observe high cpu usage, just high heap fragmentation.

And you wouldn't notice high CPU usage on the "normal" sense, the limit is time based and with respect to the allocatable CPU. Can you take a look at this graph for both the serilog and Diagnostics deployments you have currently running, and then to the serilog deployment that you are changing today? In my case, I see degradation as soon as I start going over 1.

ghost commented 2 years ago

Hi @amanda-tarafa I just pushed the change recently; i'll get back with observation when i have some.

Regarding throttling, keep in mind i have 16 set as limit on a 2-core machine, so i do not see how i can be subject to limit-based throttling.

Also before i've pushed the serilog change just now, i had 0 recycles for most deployment except for one (which uses gcp libs) which had a few, but way less than it used to have on the timed buffer.

Is the serilog pod being recycled at all? Are health checks there still failing ocasionally, even if not 3 times in a row?

Haven't noticed a recycle on the pods that use serilog.

When you moved to Diagnostics but cleared the default logging providers

If by this you mean when i commented out the following, then i can say i haven't noticed a positive change.

 //  logging.AddConfiguration(context.Configuration.GetSection("Logging"));
//   logging.AddDebug();
//   logging.AddEventSourceLogger();

I could probably confirm my heap fragmentation/allocation hypothesis using metrics: correlate recycles with cg events or fragmentation numbers; but need time to export the metrics and then figure out a way to correlate.

amanda-tarafa commented 2 years ago

Thanks for the push, let's see what happens.

And this is what I was interested in seeing if it had a positive effect, i.e. when you cleared all the default logging providers.

hostBuilder.ConfigureLogging((context, logging) =>
    {
        logging.ClearProviders();
    })
.ConfigureWebHostDefaults(webBuilder =>
{
    var logName = "....";
    // Also set the no buffer option so that tracing is attempted immediately.
    var bufferOptions = BufferOptions.TimedBuffer(TimeSpan.FromSeconds(5));

    webBuilder.UseGoogleDiagnostics(
            projectId: ".....",
            serviceName: "....",
            serviceVersion: "vNext",
            loggerOptions: LoggerOptions.Create(
                logLevel: LogLevel.Information,
                loggerDiagnosticsOutput: Console.Out,
..........

Actually, there might be a faster way to test wether you are being throrled or not. If you are willing, we can disable throtling all together (just found out about this) and check what happens with the worst offender of all.

Set the kubelet option cpuCFSQuota to false for one of your nodes.
Deploy there your worst offender with the original configurarion, i.e. using the Diagnostics (gcp) libraries without clearing the providers, etc.
If it works well, or even if it works better, we will know you were being throtled.
You can then check the metric I shared earlier to see your CPU consumption peeks and set the pod limits to maybe double that. And restore the config option. (I will do this is as well on my tests and report back).

ghost commented 2 years ago

I do not see how the pod cpu limit can play a part since i am using a 16 cpu on the limit.

The cpu allocatable utilization metric never went close to 1

Since making the serilog change this morning, no recycles, while i had recycles on the pod which uses gcp libs directly.

If it helps you, i am on gcp support, so you might be able to look at the cluster/metrics directly.

amanda-tarafa commented 2 years ago

If it helps you, i am on gcp support, so you might be able to look at the cluster/metrics directly.

This will greatly help, I could see all the metrics and I could also set up my environment to closely mimick yours. If you can open a case, link to this issue and ask them to assign to me, that'd be great.

Again, what I'm currently seeing on my side has nothing to do with what you are seing, and I'm indeed affected by throtling, as I was able to confirm by doing steps 1 to 4 from my last comment, I ran this twice, same result both times. If I disable throtling or increase my allocatable CPU all the issues I see dissapear.

If you are able, and want to try the Diagnostics (gcp) libraries directly with throtling disabled and check for improvements, that would be great as well. I do understand from your graph though that you haven't reached the conditions for throtling to happen and that makes my theory less likely on your case.

a 16 cpu on the limit

I know you've said this many times, but I'm guessing this is the total CPUs for your cluster? Notice that the limit is milliCPUs so for it to be really 16CPUs, in the image below, under CPU allocatable you would have to see 16000mCPU. Is that what you see?

ghost commented 2 years ago

no, so the deployments are configured with 16cpu as limit and 100m as request; and today i am running these pods on nodes with 2 cpus.

i'll open a case and you can look for yourself.

amanda-tarafa commented 2 years ago

I can confirm that the case is being routed to me, I'll start looking in detail tomorrow.

ghost commented 2 years ago

thank you Amanda.

ghost commented 2 years ago

@amanda-tarafa current status.

with a nodepool that does not enforce limits (cfsquota = false, confirmed from logs) and with my original code (timed buffer and with using diagnostics loggers as well) i get tons of recycles.

The deployments which use serilog have 0 recycles to date.

Last test i am doing now is to keep diagnostic logging with a timed buffer.

amanda-tarafa commented 2 years ago

with a nodepool that does not enforce limits (cfsquota = false, confirmed from logs) and with my original code (timed buffer and with using diagnostics loggers as well) i get tons of recycles.

This fully rules out throtling then. Still don't know why I can't reproduce but that's beside the point.

So, I'm proposing we do as follows, if you are able:

I'll make a first change, instead of using an array based List, I'll use a LinkedList and leave the rest as is. I'll generate a nuget package locally and share it with you so you can try with that. Mainly just so we can find what is making the threads stall, and this one works towards proving your resizing theory. If I just remove the lock, then we won't know and a possible underlying problem may remain.
Depending on the results from above, I can then merge my concurrent collection implemention.
Note that we just released Google.Cloud.Diagnostics.AspNetCore3 V4.3.0 and a few configuration aspects were made obsolete and replaced by similar ones, so you would have to change your Startup code some.

If you are OK with this let me know, and my advice would be that, in preparation, you upgrade to V4.30 and make the necessary changes swap the obsolet options.

googleapis / google-cloud-dotnet

Thread starvation on asp.net core from logging #7177

Environment details

Steps to reproduce