googleapis / google-cloud-dotnet

Google Cloud Client Libraries for .NET
https://cloud.google.com/dotnet/docs/reference/
Apache License 2.0
932 stars 365 forks source link

Thread starvation on asp.net core from logging #7177

Closed ghost closed 2 years ago

ghost commented 3 years ago

Environment details

Steps to reproduce

  1. use the official way to hookup an asp.net 5 app to google diag
  2. log intensively

Hi fine folks. Using the official asp.net core diag package i'm getting thread starvation due to the way logging is using synch semaphores to protect the buffer.

The simple proof is the next screenshot from VS for a dump i collected from prod.

image

The diagnostics was setup like this

  webBuilder.UseGoogleDiagnostics(
            projectId: "",
            serviceName: "",
            serviceVersion: "",
            loggerOptions: LoggerOptions.Create(
                logLevel: LogLevel.Information,
                loggerDiagnosticsOutput: Console.Out,
                logName: logName,                         
                monitoredResource: new Google.Api.MonitoredResource
                {
                    Type = "k8s_container",
                    Labels =
                    {
                            { "project_id", ""},
                            { "location", "us-east1-b"},
                            { "cluster_name", ""},
                            { "namespace_name", "default"},
                            { "pod_name", Environment.MachineName }
                    }
                }));

I think the bug is related to the semaphore slim used to protect the List that acts as a buffer. This is used on the Receive method, which is invoked from the common GoogleLogger.

I know it's protecting a small operation over an in-memory data structure, but it does mess up my process. Why aren't the .net concurrency-friendly collections used instead of manual sync over non-concurrency friendly collections?

lmk if you need more info or in case you believe there's smth wrong with how I use the gcp libs.

Thanks for looking into this!

amanda-tarafa commented 3 years ago

I will look at this on Monday.

amanda-tarafa commented 3 years ago

Also if you can please take a look at this comment on this other issue. It's the only known cause for thread starvation but it's related to how the Google credentials are obtained https://github.com/googleapis/google-api-dotnet-client/issues/1807#issuecomment-818861719

ghost commented 3 years ago

Thanks for looking into this @amanda-tarafa

I only use 2 GCP services from my app, the diag suite and object storage. For Object Storage i have async initialization (StorageClient.CreateAsync()) on main, between Build and Run.

For now i've removing gcp diag to keep the service alive, but i would like to add it back. Like it a lot.

Before switching it out, i've analyzed a few other dumps and all were stuck in logging, waiting for that semaphore.

I'm also setting the min worker threads using the ThreadPool api SetMinThreads (not touching iocp threads since i'm on linux).

amanda-tarafa commented 3 years ago

Can I ask if what you are seeing is the same thread(s) never getting a chance to log or just many different threads competing for a single resource? Can you see in the dumps for how long has a thread been stuck there? Also, when you say "log intensively", can you give numbers? How many threads are trying to log at the same time and at which frequency? I want to try and reproduce.

I'm pretty sure there's no deadlocking bug or anything similar, just that this is a "scarce" resource. I'll look at changing the implementation for using one of the concurrent collections and that might improve things some, but still, this will be a botleneck and I'm not sure how we can get around that (without major library rewriting). Notice that making the operations async is not an option because the Microsoft.Extensions.Logging.ILogger interface does not define an async overload.

ghost commented 3 years ago

Hi @amanda-tarafa

I only have mini dumps as full dumps killed the pod; I'll try to analyze them later when i get some time. What I have observed is:

  1. pods with high thread count up to tens of thousands (i know it's crazy, but I've seen it); tbh for me even few hundred is a signal that smth is wrong. I've seen the service becomes seriously degraded after 1k.
  2. I think it's multiple threads that cannot log, not few well-known threads. All my code is async, not manipulating threads explicitly.

Regarding numbers, I do expect to have few thousands log writes / sec.

The execution environment is GKE with Ubuntu OS; I am not exactly sure if this makes any difference..

amanda-tarafa commented 3 years ago

Thanks, I'll try to reproduce and work on using the concurrent collections to see if that makes any difference. I'll report here when I know more. Anything particular about your GKE configuration?

ghost commented 3 years ago

Nothing i can think of; latest gke, ubuntu, some daemon sets, sometimes extensive high cpu usage on nodes, max 32/64 pods / node, though I rarely see more than a few tens of pods. I'm on GCP support, I can open a case and share private info there, if that helps.

amanda-tarafa commented 3 years ago

Thanks, and good to know we can get more detailed info if we need to. Let me try with what you have shared first and we can go from there.

amanda-tarafa commented 3 years ago

OK, I've managed to reproduce this. I've now set up some benchmarking so it'll be easier to see if using concurrent collections results in any improvements. If it doesn't, we'd have to think of some more drastic changes, but let's get there first.

ghost commented 3 years ago

How did you repro this? Just by logging at high scale?

On Fri, Sep 24, 2021 at 4:10 PM Amanda Tarafa Mas @.***> wrote:

OK, I've managed to reproduce this. I've now set up some benchmarking so it'll be easier to see if using concurrent collections results in any improvements. If it doesn't, we'd have to think of some more drastic changes, but let's get there first.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/googleapis/google-cloud-dotnet/issues/7177#issuecomment-926656677, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJF5YQPY2DWWNGEKC6EJXZ3UDSBFNANCNFSM5EIHQMMA .

-- Catalin Adler VP of Engineering, DevFactory

amanda-tarafa commented 3 years ago

Yes, I can share the full code later, but basically:

amanda-tarafa commented 3 years ago

OK, after much trials here are my findings:

So, again, this pretty much confirms that there's no issue with Google.Cloud.Diagnostics.AspNetCore*. I will created an issue for .NET 5 later today or tomorrow so they can look into it.

Can you please confirm on your end?

ghost commented 3 years ago

Thanks for this @amanda-tarafa !

I'll run an experiment and come back to you. Would be good to get an RCA for this, it is curious.

amanda-tarafa commented 3 years ago

Yes I'll try to get to the bottom of it. I think it's either one of the default loggers or the agent running on the pod, that collects stdout logs. I think I can easily answer that with the setup I already have.

amanda-tarafa commented 3 years ago

OK, I've gotten as far as I can go. The culprit is the Cloud Logging agent that runs by default on gke, which captures stdout and stderr and sends it to Google Cloud Logging. Disabling ASP.NET Core default loggers works, because then there's nothing writing to stdout and stderr so there's nothing for the agent to capture.

The trial that easily confirms that the performance issues lie with the agent is as follows.

ASP.NET Core app started:

public static IHostBuilder CreateHostBuilder(string[] args) =>
    Host.CreateDefaultBuilder(args)
        .ConfigureWebHostDefaults(webBuilder =>
        {
            webBuilder
            .UseStartup<Startup>()
            .ConfigureLogging(builder => builder.ClearProviders()); // Let's make sure there's no logging provider at all, so that we know they don't interfere.
        });

And then a controller action:

[HttpGet]
public long Get()
{
    var stopWatch = Stopwatch.StartNew();
    Console.WriteLine("Logging to test the agent");
    stopWatch.Stop();

    return loggingTime;
}

With this app deployed in a default gke, and having a client that generates about 7k requests per second, we get the following stats.

And doing so basic structure loggin, these are the stats:

In turn, if the agent itself is disabled (only for Workloads, it can remain enabled for system), then any combination of writing directely to stdout, ASP.NET Core Logger Providers and Google.Cloud.Diagnostics.AspNetCore* works perfectly (0 requests fail) for a volume of 7k request per second.

I'll create an issue for the Logging agent team and report back here when I know more.

For now, you'll either have to disable the agent (just for Workload) or disbale ASP.NET Core Console Logger.

ghost commented 3 years ago

Hi @amanda-tarafa I just pushed the code in prod yesterday and haven't observed any issues. I'll keep monitoring and let you know.

amanda-tarafa commented 3 years ago

That's good to know @CatalinAdlerDF . Do let me know if you see any issues, and I'll report back here when I hear from the Logging Agent team.

amanda-tarafa commented 3 years ago

@CatalinAdlerDF any news on your experiments?

If everything is fine, I'd like to close this issue as there's not much we can do on the repo. Of course you can leave a comment if the issues reappear or create a new issue.

I will come back here and update when I have more information from the Logging Agent team.

ghost commented 3 years ago

Hi Amanda,

I'll close this for now, though there are still some concerns I have on thread starvation. Need to repro and dump a snapshot and in case it's related to gcp, i'll reopen this.

I do suggest you document this somewhere.

Thank you for your assistance!

amanda-tarafa commented 3 years ago

Thanks!

Yes, I'll add some notes on this on Google.Cloud.Diagnostics docs once I know more from the agent team. This might very well be due to a bug that they fix, and then there's no need to document it.

Again, regardless of the outcome, once I know more I'll come back here.

ghost commented 2 years ago

@amanda-tarafa I still get significant blockage on logging 90 threads waiting for that semaphore.

image

amanda-tarafa commented 2 years ago

Can you share, just so I can try and reproduce:

ghost commented 2 years ago

Which of the workarounds are you using: a) did you disabled the logging agent on the cluster or b) did you cleared .NET logging providers.

Cleared logging providers.

What's the volume of logging requests, still around 7k per second?

Way small, maybe 100.

Are you aware of any upgrades on your cluster

I did upgrade to latest GKE.

Nothing major happened, just maybe a little bit more load. The current mitigation has been to separate load by endpoint: 1 endpoint -> 1 k8s deployment. bulkhead fault isolation. I have few endpoints which appear to be causing this, but they are also quite simple and they are some of the most active endpoints, by a significant margin.

The experiment i am running now is to switch to serilog for only the deployment which is most effected (had like 30 recycles in 2h)

amanda-tarafa commented 2 years ago

Way small, maybe 100. Nothing major happened, just maybe a little bit more load.

I'm a little confused by these two statements. Before we were working on the assumption of several 1000s logging requests per second (I ran my experiments with ~7k logging requests per second). Now you say it's maybe 100 logging requests per second but that this is more load than before?

The experiment i am running now is to switch to serilog for only the deployment

I'm not entirely certain what you mean by this? Do you have a separate app that deploys the main app, and now the deployment app is using serilog and the main app is using Diagnostics?

BTW, that deployment is the most affected phase wouldn't surprise me given my conclusion that this is the logging agent, specially if your workaround has been to clear .NET providers. Many underlying sytems may be writing to the console at deployment time (by removing .NET providers it's just the app that's not writing to the console, but everything else is). Since the agent remains active, it's still consuming resources, which ends up starving everything else. Maybe run an experiment by leaving in the providers and deactivating the agent?

Next week, when you have clarified these remaining details, I'll run some more tests. And I'll also follow up with the logging agent team.

amanda-tarafa commented 2 years ago

One last question on your comments:

I did upgrade to latest GKE.

So, you saw the blockage again after you upgraded to latest GKE? Before that and after implementing the workaround, everything was fine?

ghost commented 2 years ago

I'm a little confused by these two statements.

Apologies. inbound http requests in maybe the hundreds, log request / second probably a lot, but did not measure recently.

I'm not entirely certain what you mean by this? Do you have a separate app that deploys the main app, and now the deployment app is using serilog and the main app is using Diagnostics?

For the k8s deployment where I have observed most of the problems i am switching from logging to stackdriver using your libs to using serilog with a stackdriver sink. If i notice a significant difference in behavior, then the culprit is the gcp lib.

So, you saw the blockage again after you upgraded to latest GKE? Before that and after implementing the workaround, everything was fine?

I cannot confirm/infirm this. I am just putting in more effort these days to fix this.

amanda-tarafa commented 2 years ago

Acked on all, thanks!

If you find serilog works significantly better, can you please share your serilog configuration so I can run some tests on that?

Note that a couple weeks ago I ran a simple test (re-sharing):

[HttpGet]
public long Get()
{
    var stopWatch = Stopwatch.StartNew();
    Console.WriteLine("Logging to test the agent");
    stopWatch.Stop();

    return loggingTime;
}

This still yielded pretty poor results for ~7k logging requests per second:

As soon as I deactivated the loggin agent, all requests were successful and the 99% of requests were done in less than 2ms. Since the Diagnostics library (or any other logging library) was not involved here, this test makes me almost 100% certain that there's a performance issue with the logging agent. As soon as I deactivated the logging agent, I could directly use Console.WriteLine, the MS Console Logger and the Diagnostics library all together with no issue at all.

One thing you can check as well, are any of the more affected services using Console.WriteLine at all anywhere? Maybe some leftover debug code or similar? Or maybe some dependency that logs directly to the Console? Again, one very quick to test for this is to log using all default .NET providers and Diagnostics, but having the logging agent disabled. I'm pretty certain you won't see any issues there.

ghost commented 2 years ago

First overnight observation. so yesterday at my evening (around 5pm) i switched to using serilog with a stackdriver sink for the top offender (the deployment which suffered the most).

for no issues when switching to serilog, while i am still observing issues with the deployments left on using gcp libs directly.

so the situation is: lots of other deployments use gcp lib directly and 1 uses serilog with a stackdriver sink. direct gcp lib code is (apologize for the formatting)

 hostBuilder.ConfigureLogging((context, logging) =>
    {
        logging.ClearProviders();
    })
.ConfigureWebHostDefaults(webBuilder =>
{
    var logName = "....";
    // Also set the no buffer option so that tracing is attempted immediately.
    var bufferOptions = BufferOptions.TimedBuffer(TimeSpan.FromSeconds(5));

    webBuilder.UseGoogleDiagnostics(
            projectId: ".....",
            serviceName: "....",
            serviceVersion: "vNext",
            loggerOptions: LoggerOptions.Create(
                logLevel: LogLevel.Information,
                loggerDiagnosticsOutput: Console.Out,
                logName: logName,
                bufferOptions: bufferOptions,
                monitoredResource: new Google.Api.MonitoredResource
                {
                    Type = "k8s_container",
                    Labels =
                    {
                            { "project_id", "...."},
                            { "location", "us-east1-b"},
                            { "cluster_name", "...."},
                            { "namespace_name", "default"},
                            { "pod_name", Environment.MachineName }
                    }
                }));
);                
    webBuilder.UseStartup<Startup>();
});

The relevant serilog code is

hostBuilder
.UseSerilog((hostingContext, loggerConfiguration) =>
    {
        loggerConfiguration.ReadFrom.Configuration(hostingContext.Configuration);
    })
.ConfigureWebHostDefaults(webBuilder =>
{
    webBuilder.UseStartup<Startup>();
});

with the following config file

 "Serilog": {
    "Using": [ "Serilog.Sinks.GoogleCloudLogging" ],
    "MinimumLevel": "Information",
    "Enrich": [ "FromLogContext", "WithMachineName", "WithExceptionDetails" ],
    "WriteTo": [      
      {
        "Name": "GoogleCloudLogging",
        "Args": {
          "projectID": ".....",
          "useJsonOutput": "true",
          "useSourceContextAsLogName": false,
          "logName": ".....",
          "serviceName": ".....",
          "serviceVersion": "vNext",
          "useLogCorrelation": true
        }
      }
    ]
  }

I plan to switch the next top offender to using serilog directly and figure out a way to automatically get minidumps when thread count goes above a certain threshold.

I've reviewed the FlushableConsumerBase code and i think the way to repro the issue is high concurrency on writing log entries in a memory bound & with a fragmented heap environment. iirc the List doubles in size if the behind the scenes arrays cannot fit the item count. So if the runtime cannot quickly find a place to allocate a new huge array, then it might trigger GC to make some room for it, which leads to pauses and maybe even failures (what if there is no room left?). And i think this is why you haven't exactly reproduced my scenario: in your sample there is no memory pressure.

Try to limit .net to a small heap, create a use case of high concurrency of logging and make sure you do lot at high rate (multiple tasks, each in a tight loop maybe) and allocate randomly and absurdly.

I think an easy fix is a concurrency aware write optimized data structure, maybe the ConcurrentQueue, or maybe not even a concurrency-aware structure. I can try to compile the code on my machine and if successful i can experiment.

amanda-tarafa commented 2 years ago

Thanks for the extensive report, I'll try with serilog on Monday

amanda-tarafa commented 2 years ago

I still haven't gotten around to running more tests, I will do so this week, including with serilog, and report back (I'll share my tests and full results as well so you can try them on your own if you want). Still I've been thinking about this over the weekend, and wanted to ask for a little more information. Here's why:

Taking these things into account:

Basically, even if you don't agree with my reasoning, it would be very useful if you could share the information about service degradation rather than, or in addition to, the memory dump. Failed service requests, and service requests times would be great, for the four combinations: Diagnostics + .NET Logging Providers, Diagnostics - .Net Logging Providers, Serilog + .NET Logging Providers and Serilog - .NET Logging Providers.

Also, if you want to test the concurrent collection implementation, take a look at #7258. This works well, but does not improves issues (as measured in failed service requests and logging times, in the presence of the logging agent) so we decided not to merge it.

ghost commented 2 years ago

Just a short answer for now. I use a liveliness probe configured using AspNetCore.HealthChecks.* for mysql and redis. In k8s it's configured to hit the /health endpoint every 60s and use a 10s timeout.

livenessProbe:
           httpGet:
             path: /api/health
             port: 80
           initialDelaySeconds: 30
           periodSeconds: 60
           timeoutSeconds: 10

K8s recycles the container when the probe fails (i think 3 times in a row). When a recycle occurs i get 502 on client side, and unhappy customers.

I am observing count of recycles.

I am also using Ben Adams, BlockingDetector for ASP.NET Core to log blocking operations.

These evidence suggests that threads are blocked and unable to handle incoming requests; going down the rabbit hole, I ended up taking mini dumps when thread count goes unusually high, and that is where I've observed gcp libs being the culprit. I have empirical evidence that when thread count goes up too much, i get liveness probes failures and recycles. I do not have that high load to justify 100 threads (and I've actually seen 3-400 easily), the top service had like under 100 inbound http req / s, and all the code that fans out processing to multiple concurrent tasks in the http handlers has been scoped down on degree of parallelism to control load on the system (mysql fails before .net does) to like a dop of 1-5.

Till now, i have 2 deployments moved from gcp libs to the serilog setup and since the change 0 recycles, meaning 0 liveness probe failures. The services which still use the gcp lib directly are still suffering recycles. as I type this.

Another thing i've noticed which is weird is high degree of heap fragmentation, like even 60%. Haven't see high count of GC events though. But i am not closely monitoring these metrics tbh, but i think these might fit into the List grow algorithm theory well.

For the serilog tests, did you also cleared the other .NET logging providers?

You have the serilog code above; not doing anything but reading from config file which i posted above.

The issue is not how many threads are waiting, the issue is how long they wait for, and what that does to the service response times overall.

I partially agree with you; the trigger is that the wait time is too long, but the RCA is why are we even blocking threads. It's just too easy to get into problems when you have blocking on hot paths (too many stories similar to this one to count), and I think the better solution is, not to get blocking work, but remove it altogether, it just avoids risk.

I don't think i have the time to play around with various solutions for this, but my thoughts are around why is the lock even needed (bc in the end it protects the item list itself), maybe a volatile ref or a lock free solution would be better.

I'll make one more change that might help prove this: move to a size based buffer on the idea that it will lead to less pressure on the memory subsystem, so maybe less waits.

Update 1: i realize this is not a short answer :D

amanda-tarafa commented 2 years ago

I was writing the following reply when it occured to me to check something. Serilog ignores all other providers by default. You can see their documentation on Enabling Microsoft.Extensions.Logging.ILoggerProviders. It would be super useful if you could run serilog with the writeToProviders set to true and see whether that makes the services degrade.

I'm still finishing my reply below because it still holds:

Thanks for the not so short answer :). All info helps, and is needed as I what I can reproduce it's different from what you are seeing.

First thing is, this library is old, and definely, if we were starting from scratch some things would be done differently (which probably includes the not using the lock part). But we now have to maintain the existing code, withough breaking existing costumers, adding new features, etc., so, code that "works" stays as is. Please rest assured that if there's an issue with the Diagnositcs library we will do our best to fix it. I'm more than happy to merge #7258 but I honestly think there's an underlying issue here that's bringing out the worst part of locking.

Basically I want to find out why the wait time for obtaining the lock is so long, as it shouldn't because the operations inside the lock are very simple; unless there's no CPU available, or similar. In my tests, the one causing the long wait times was the logging agent. But that doesn't seem to be your case, and I still want to know why before making changes to the library.

As for what you are seeing in your probes, that may still match with Diagnostics blocking not being the issue, but just a symptom. Let's say that there's something consuming lots of CPU (can you check CPU usage?)

So again, ideally we could see request times for the 4 combinations:

When using Diagnostics, you also have tracing enabled for sure, through the library, so you might have a sampling of requests already, with how long they took, or even if they timed out. And since you are running on k8s, I think you will have some tracing information regardless of whether you are using Diagnostics or not, so, for serilog there might be something you could look at already as well.

I'll still run the tests on my side and report back. And we'll get there.

ghost commented 2 years ago

Hi @amanda-tarafa thanks for your similar-in-size answer. Before i act on it, so more observations; after i've changed the buffer type to sized-based, i've barely noticed any recycles (so little that i ignore them), so it does appear to have had an impact.

cpu is not used at high rate afaik.

tbh i got confused on your comments on diagnostics / .net logging providers. need some time to digest it.

amanda-tarafa commented 2 years ago

after i've changed the buffer type to sized-based, i've barely noticed any recycles (so little that i ignore them), so it does appear to have had an impact

The simple explanation for this is that the timed-based buffer is of course triggered by a timer, the timer triggers every 5 seconds by default. This grabs the lock to swap the list reference, releases the lock, and after releasing the lock writes the entries to the server. 5 seconds should be more than enough to swap a reference, but on a degraded system, threads might be getting stuck in here (as well as in logging). This would also account for the amount of stucked threads you see not matching the load you have, some of those stucked threads are the timer ones. This doesn't happen on sized-based because the list reference is swapped on a request thread. We've seen this happen on environments that throttle CPU, like Cloud Run, and we've added a note (second note top down) to our docs about that. I wonder, have you set CPU limits on your pods? I didn't think of this earlier, but if you set low CPU limits to your pods, then that might very well be part of the problem? There's a medium article describing some of this and in particular look for "Checking the throttling rate of your pods", if throttling is high enough, then that would explain (almost) everything.

It could even explain why removing the the .NET Logging providers works for me but not for you. My application is very light and I removed the "only other" thing that might be cosuming CPU and working towards those limits, whereas your application is heavier and you continue to use mysql, Storage, etc.

(let me know if you want me to go into more detail about the .NET Logging providers, etc.)

ghost commented 2 years ago

I don't think it's like that. I think it's more like: with a time based buffer, if you have tons of logs in between timer callbacks, then you get into List growth algorithm which is double-in-size which is a operation that depends on the state of the heap; i used the 5s buffer. I have 16 cpu as a limit. But afaik this is not how k8s uses the limits; the limits are not hard limits; they just affect the priority of the kill algo, pods which use more than their defined limits will be killed before pods which are under their limits.

amanda-tarafa commented 2 years ago

Hmmm, yes, that is happening, but I didn't see any of that being a problem on my tests, with 7k log requests/second, which means 35k log requests in between timers, wich means that the list grows several times for me as well.

For CPU limits I didn't mean total CPUs, but allocation per node: image

I'm not an expert in k8s tbh, but I will check throttling when running my tests and see if there's something there on my side.

ghost commented 2 years ago

wich means that the list grows several times for me as well.

It needs to grow while heap is under pressure; this is to make that allocating a double-in-size array takes a lot of time.

in your example on cpu limits, it reads like this "213mcpu have been reserved and 940 are available for reservation". with how containers work, that has nothing to do with how much cpu a pod can actually use; you can reserve 1 cpu and use 4 if they are available.

I use e2 machines with some cpu today (2 or 4 i think)

amanda-tarafa commented 2 years ago

From Kubernets docs: How pods with resource limits are run

  • The spec.containers[].resources.limits.cpu is converted to its millicore value and multiplied by 100. The resulting value is the total amount of CPU time in microseconds that a container can use every 100ms. A container cannot use more than its share of CPU time during this interval.

... A Container might or might not be allowed to exceed its CPU limit for extended periods of time. However, it will not be killed for excessive CPU usage.

From Google docs on Kubernetes best practices (In the section Container settings -> CPU):

It’s when it comes to CPU limits that things get interesting. CPU is considered a “compressible” resource. If your app starts hitting your CPU limits, Kubernetes starts throttling your container. This means the CPU will be artificially restricted, giving your app potentially worse performance! However, it won’t be terminated or evicted. You can use a liveness health check to make sure performance has not been impacted.

So, I'm pretty sure your services' thread are being throttled. And that actually matches my theory and your observation:

after i've changed the buffer type to sized-based, i've barely noticed any recycles (so little that i ignore them), so it does appear to have had an impact.

If it was the list resizing on an under pressure heap (outstanding question is why does the heap is under pressure in the first place) that makes the lock operations take long and the threads stuck on the locks, then changing to sized-based would have entirely fixed the issues, i.e. no recycles at all. Your liveness healt check is still failing ocassionally because there's something else going on, and that is threads being throttle, because you are probably consuming at times more CPU than the one specified in the limit. And precisely the liveness health check is the one that's recommended to check whether you are being throttled and your performance impacted.

In my case, disabling the .NET log providers or the loggin agent gave me back enough CPU so that I wasn't throttled again. In your case, since you continue to do some other heavy operations, it's not enough to clear the providers, yo need to maybe set limits higher.

Mind you, I can very much see how given that threads are being throtle, Diagnostics would make things worst, specially the timed buffer because of all the timer threads that would get throtled themselves and end up starving the thread pool. But there needs to be something else going on. This won't happen on its own. The list sizing, at least with the volumes we are talking about, can hardly be the main cuplrit here, and probably won't even lead on its own to the heap fragmentation you are seeing:

Sorry to keep insisting, I just want things to really add up, and right now, they don't fully for me.

A few more questions:

hostBuilder
.UseSerilog((hostingContext, loggerConfiguration) =>
    {
        loggerConfiguration.ReadFrom.Configuration(hostingContext.Configuration);
    }, writeToProviders: true)
.ConfigureWebHostDefaults(webBuilder =>
{
    webBuilder.UseStartup<Startup>();
});
ghost commented 2 years ago

A quick note, k8s has moved away from using docker as a container engine. the second observation is maybe valid, though i am somewhat doubtful. I am pretty sure that container is not killed when going over limit, it depends on context.

Though afaik the nodes where my code is executing are not having lots of cpu usage; i was not able to observe high cpu usage, just high heap fragmentation.

ghost commented 2 years ago

why does the heap is under pressure in the first place

Regular allocations during request processing and whatever other things are happening there. Keep in mind .net has heuristics to size the heap based on container requests and limits.

would have entirely fixed the issues

don't think 1-2 recycles on a period where i used to have tens count for much. There is other code there as well, can hit another problem as well.

Is the serilog pod being recycled at all?

no recycles on anything moved to serilog and only 2 recycles when using gcp lib; previously i had tens of recycles / day.

Can you configure serilog as follows and report back on recycles and/or health check failures?

I'll make the change today and we'll have some observations by tomorrow.

amanda-tarafa commented 2 years ago

A quick note, k8s has moved away from using docker as a container engine. the second observation is maybe valid, though i am somewhat doubtful. I am pretty sure that container is not killed when going over limit, it depends on context.

Yep, I don't think the container is being killed, I think threads are being throtled, which leads to health check failings, which leads to recycles.

only 2 recycles when using gcp lib

Is this with timed buffer or sized buffer? Are the .NET logging providers cleared or not? I'm very interested in knowing wether the Diagnostics configuration that you pasted in this comment https://github.com/googleapis/google-cloud-dotnet/issues/7177#issuecomment-955169695 made any significant difference with respect to not having cleared the default providers. For instance, after that change you mentioned that you had some top offenders, does this means that before that change, more services were affected?

Though afaik the nodes where my code is executing are not having lots of cpu usage; i was not able to observe high cpu usage, just high heap fragmentation.

And you wouldn't notice high CPU usage on the "normal" sense, the limit is time based and with respect to the allocatable CPU. Can you take a look at this graph for both the serilog and Diagnostics deployments you have currently running, and then to the serilog deployment that you are changing today? In my case, I see degradation as soon as I start going over 1.

image

ghost commented 2 years ago

Hi @amanda-tarafa I just pushed the change recently; i'll get back with observation when i have some.

Regarding throttling, keep in mind i have 16 set as limit on a 2-core machine, so i do not see how i can be subject to limit-based throttling.

Also before i've pushed the serilog change just now, i had 0 recycles for most deployment except for one (which uses gcp libs) which had a few, but way less than it used to have on the timed buffer.

Is the serilog pod being recycled at all? Are health checks there still failing ocasionally, even if not 3 times in a row?

Haven't noticed a recycle on the pods that use serilog.

When you moved to Diagnostics but cleared the default logging providers

If by this you mean when i commented out the following, then i can say i haven't noticed a positive change.

 //  logging.AddConfiguration(context.Configuration.GetSection("Logging"));
//   logging.AddDebug();
//   logging.AddEventSourceLogger();

I could probably confirm my heap fragmentation/allocation hypothesis using metrics: correlate recycles with cg events or fragmentation numbers; but need time to export the metrics and then figure out a way to correlate.

amanda-tarafa commented 2 years ago

Thanks for the push, let's see what happens.

And this is what I was interested in seeing if it had a positive effect, i.e. when you cleared all the default logging providers.

hostBuilder.ConfigureLogging((context, logging) =>
    {
        logging.ClearProviders();
    })
.ConfigureWebHostDefaults(webBuilder =>
{
    var logName = "....";
    // Also set the no buffer option so that tracing is attempted immediately.
    var bufferOptions = BufferOptions.TimedBuffer(TimeSpan.FromSeconds(5));

    webBuilder.UseGoogleDiagnostics(
            projectId: ".....",
            serviceName: "....",
            serviceVersion: "vNext",
            loggerOptions: LoggerOptions.Create(
                logLevel: LogLevel.Information,
                loggerDiagnosticsOutput: Console.Out,
..........

Actually, there might be a faster way to test wether you are being throrled or not. If you are willing, we can disable throtling all together (just found out about this) and check what happens with the worst offender of all.

  1. Set the kubelet option cpuCFSQuota to false for one of your nodes.
  2. Deploy there your worst offender with the original configurarion, i.e. using the Diagnostics (gcp) libraries without clearing the providers, etc.
  3. If it works well, or even if it works better, we will know you were being throtled.
  4. You can then check the metric I shared earlier to see your CPU consumption peeks and set the pod limits to maybe double that. And restore the config option. (I will do this is as well on my tests and report back).
ghost commented 2 years ago

I do not see how the pod cpu limit can play a part since i am using a 16 cpu on the limit.

The cpu allocatable utilization metric never went close to 1

image

Since making the serilog change this morning, no recycles, while i had recycles on the pod which uses gcp libs directly.

If it helps you, i am on gcp support, so you might be able to look at the cluster/metrics directly.

amanda-tarafa commented 2 years ago

If it helps you, i am on gcp support, so you might be able to look at the cluster/metrics directly.

This will greatly help, I could see all the metrics and I could also set up my environment to closely mimick yours. If you can open a case, link to this issue and ask them to assign to me, that'd be great.

Again, what I'm currently seeing on my side has nothing to do with what you are seing, and I'm indeed affected by throtling, as I was able to confirm by doing steps 1 to 4 from my last comment, I ran this twice, same result both times. If I disable throtling or increase my allocatable CPU all the issues I see dissapear.

If you are able, and want to try the Diagnostics (gcp) libraries directly with throtling disabled and check for improvements, that would be great as well. I do understand from your graph though that you haven't reached the conditions for throtling to happen and that makes my theory less likely on your case.

a 16 cpu on the limit

I know you've said this many times, but I'm guessing this is the total CPUs for your cluster? Notice that the limit is milliCPUs so for it to be really 16CPUs, in the image below, under CPU allocatable you would have to see 16000mCPU. Is that what you see? image

ghost commented 2 years ago

no, so the deployments are configured with 16cpu as limit and 100m as request; and today i am running these pods on nodes with 2 cpus.

image

i'll open a case and you can look for yourself.

amanda-tarafa commented 2 years ago

I can confirm that the case is being routed to me, I'll start looking in detail tomorrow.

ghost commented 2 years ago

thank you Amanda.

ghost commented 2 years ago

@amanda-tarafa current status.

with a nodepool that does not enforce limits (cfsquota = false, confirmed from logs) and with my original code (timed buffer and with using diagnostics loggers as well) i get tons of recycles.

The deployments which use serilog have 0 recycles to date.

Last test i am doing now is to keep diagnostic logging with a timed buffer.

amanda-tarafa commented 2 years ago

with a nodepool that does not enforce limits (cfsquota = false, confirmed from logs) and with my original code (timed buffer and with using diagnostics loggers as well) i get tons of recycles.

This fully rules out throtling then. Still don't know why I can't reproduce but that's beside the point.

So, I'm proposing we do as follows, if you are able:

If you are OK with this let me know, and my advice would be that, in preparation, you upgrade to V4.30 and make the necessary changes swap the obsolet options.