microsoft / service-fabric

Service Fabric is a distributed systems platform for packaging, deploying, and managing stateless and stateful distributed applications and containers at large scale.
https://docs.microsoft.com/en-us/azure/service-fabric/
MIT License
3.02k stars 399 forks source link

Service Fabric Stateful Service in Azure - RAM usage keeps growing. Possible memory leak #584

Closed Triplebg closed 3 years ago

Triplebg commented 5 years ago

I am running a Service Fabric application on a cluster in Azure. The cluster has two scale sets:

There are two types of services in the application

For my tests I am using Application Insights and Service Fabric Analytics to track performance. I am observing the following parameters:

Metrics for the stateful scale set: Percentage CPU, Disk Read/Write operations/sec

Applicaiton Insights: Server Response Time - which corresponds to the execution time of the method that receives the statuses in the stateful StatusConsumer.

Service Fabric Analytics: Performance Counters with log analytics agent on the stateful nodes - used to observe the RAM usage of the nodes.

Each simulated system sends its status every 30 seconds.

In the beginning of the test the RAM usage is around 40% on each node, Avg Server Response time is around 15ms, CPU usage is around 10%, and Read/Write operations are under 100/s.

Immediately after the start of the test the RAM usage starts slowly building up, but there is no difference in the other observed metrics.

After about an hour of simulating 1000 systems the RAM usage is around 90-95% and problems start to show in the other metrics - Avg Server response time peaks with values around 5-10 seconds and Disk Read/Write operations reach around 500/sec.

This continues for 1-3 minutes then the RAM usage drops and everything goes back to normal. RAM_Usage ServerResponseTime On the images you can see that the RAM peak corresponds to the server response time peak. In the end of the RAM graph the usage is flat in order to show what is the behaviour without simulating any systems.

The number of the systems simulated only reduces or increases the time it takes for the RAM to reach critical levels - in one of the tests 200 systems were simulated and the RAM usage rise was slower.

As for the code: In the first tests the code was more complicated, but to find the cause of the problem I started removing functionality. Still there was no improvement. The only time the RAM usage did not rise was when I commented all the code and in the body of the method that receives the status were only a try/catch block and a return. Currently the code in the StatusConsumer is this:

public async Task PostStatus(SystemStatusInfo status){
    try{
        Stopwatch stopWatch = new Stopwatch();

        IReliableDictionary<string, SystemStatusInfo> statusDictionary = await this.StateManager.GetOrAddAsync<IReliableDictionary<string, SystemStatusInfo>>("status");

        stopWatch.Start();

        using (ITransaction tx = this.StateManager.CreateTransaction())
        {
            await statusDictionary.AddOrUpdateAsync(tx,"lastConsumedStatus",(key) => { return status; },(key, oldvalue) => status);
            await tx.CommitAsync();
        }

        stopWatch.Stop();

        if (stopWatch.ElapsedMilliseconds / 1000 > 4) //seconds
        {
            Telemetry.TrackTrace($"Queue Status Duration: { stopWatch.ElapsedMilliseconds / 1000 } for {status.SystemId}", SeverityLevel.Critical);
        }

    }
    catch (Exception e) {Telemetry.TrackException(e);}
}

After connecting to the nodes with remote desktop, in Task Manager I can see that when the RAM usage is around 85% the "Memory" of the SystemStatusConsumer process, which "holds" the microservice's instances, is not more than 600 MB. It is the highest consumption but it is still not that high - the node is with 8 GB of RAM. However I don`t know if this is useful information in this case.

How can I diagnose and/or fix this?

evadamsms commented 5 years ago

My suspicion is that you're seeing more or less expected behavior for .net applications when running with server GC, which does less frequent collections. This results in higher average memory usage, and each collection is more expensive. To dig deeper you can take a look at the .net performance counters (especially the memory/GC counters) to see if these time periods coincide with GC collections.

refs:

MedAnd commented 5 years ago

Try setting CheckpointThresholdInMB (amount of log file space after which the state is checkpointed) to 1MB, the default is 50MB.

Also never use Server GC as per High memory usage when calling a Stateful Service when using GCServer true

Triplebg commented 5 years ago

I was using the default mode of garbage collection which I understand is workstation. To be sure I explicitly set it to false (<gcServer enabled="false"/>) in the App.config file of my stateful service. There was no difference in the results. I started observing more performance counters with Service Fabric Analytics:

I also tried setting CheckpointThresholdInMB to 1 in the Settings.xml of the stateful service and there was still no noticeable difference - the RAM usage is still the same as the screenshot in the first comment. I also changed MaxAccumulatedBackupLogSizeInMB to 1 (default 800) for the stateful service and set KtlLogger/WriteBufferMemoryPoolMinimumInKB and KtlLogger/WriteBufferMemoryPoolMaximumInKB to 16384 in the cluster configuration.

With all these changes there is no improvement.

MedAnd commented 5 years ago

Next steps would be to take a process dump, open an official ticket... this could be related to bugs fixed in the upcoming 6.5 release with resource utilization.

Out of interest, with CheckpointThresholdInMB set to 1, try calling GC collect within the Stateful Service every (x) iteration or every (y) interval.

Triplebg commented 5 years ago

I took several process dumps of the stateful services` host process. The dumps were taken in different points in time - one after creating all services, one when the memory consumption was medium - around 75%, one at high memory consumption - around 90% and one after the memory dropped. Importing them to dotMemory showed almost no difference in memory usage in the different dumps. I also attached the profiler to the live process in one of the nodes and observed the real time memory usage. It was always around 550 MB total usage (managed + unmanaged memory) with some highs and lows from 580 to 520 but they were not related to the overall memory usage of the node. From the graphic it seemed that the garbage collector worked normally and regularly.

In one of my tests I called GC.Collect every 10 requests with the CheckpointThresholdInMB set to 1 as requested. After finding out this has no effect I tried calling it every time but only the CPU usage went up.

I also upgraded all NuGet packages related to Service Fabric to the latest for the moment(6.4.664) and also manually set the Fabric runtime of the cluster to 6.4.664. There was no improvement. Up to this point the OS of the scale sets was Windows 2012 Datacenter. I tried changing it to Windows 2016 Datacenter with no positive effect. I am going to try the other available OS versions except those including containers.

At this point I am pretty sure it is not a memory leak, at least in the SystemStatusConsumer process. Also I am running out of ideas. I will proceed with opening an official ticket as suggested.

Am I right to assume that after creating all the services and receiving their statuses the Memory usage should stabilize (or after this and filling some transaction log)? After all the amount of data in the services does not change - the data is just updated with its latest version.

Is it possible the reason for this behaviour is in the way I`m creating the services or some other aspect of their implementation? Can the reason be that I create one service for each system?

MedAnd commented 5 years ago

Would be good to try the just released SF 6.5, as it fixes known and serious issues such Stateful Service high and increasing CPU usage

PS. You might have to wait a few days though as the run-time is still rolling out across Azure...

derekeom commented 5 years ago

It looks like you're not quite certain which process or processes are increasing the overall RAM usage. Perhaps you can try taking a snapshot of all processes and their memory usage when its low vs when its high?

MedAnd commented 5 years ago

Wonder if this issue could be FabricClient related... seems other customers are observing similar behavior as per Memory Leak from FabricClient?

Triplebg commented 5 years ago

I'm not using FabricClient in the stateful service`s project. The only place I am using it is in the WebAPI stateless service, which runs on a different scale set. Can it still cause a problem?

To get records for the Memory usage of all processes I used PowerShell's "Get-Process" and the WorkingSet64 and PagedMemorySize64 parameters. Results when 99% RAM / 98.5% committed bytes in use compared with results when 27% RAM / 37% committed bytes in use (after memory usage dropped) showed that the most significant decreases in WorkingSet64 were in SystemStatusConsumer - around 100MB and in FabricDCA processes. The significant decrease in PagedMemorySize64 were in SystemStatusConsumer - around 320MB and in FileStoreService - around 90MB. There are few processes that decrease around 30 to 50MB but most of them are around 1-10MB. If there is a better method/tool for taking a snapshot of all processes and comparing them I`ll be very thankful if you could share it! An interesting thing for me was that after adding all WorkingSet64 results the total amount was much less than the actual RAM usage. This was also true for PagedMemorySize64 .

To be clear the graph from the first comment is actually for "% Committed Bytes In Use". Actual RAM usage increases a little bit faster - when "% Committed Bytes In Use" is at 90%, RAM usage of the node is already at 99%.

PS: SF 6.5 is not available for now so I haven't done tests with it.

Triplebg commented 5 years ago

Observing the non-paged pool with the task manager and rammap showed that most of the RAM is used by it. When the RAM is at 99% 6.5GB(of 8GB total) of it is taken by the non-paged pool. Using poolmon, the tags that show most bytes used are:

After the RAM usage decreased by itself the non-paged pool and the numbers shown in poolmon for the 3 tags dropped significantly.

Immediately after stopping the simulator that sends the requests overall RAM usage drops to 35% and non-paged pool drops to 1.3GB. The memory usage shown in the poolmon is:

MedAnd commented 5 years ago

It would be good to post a full list of processes and their respective memory utilization stats, including Service Fabric processes such as the ones above. In addition, if you track all the recommended .Net Performance counters, especially GC, this would also be helpful to post:

evadamsms commented 5 years ago

Thanks for the additional data!

* KLog - around 4.3GB

I'm curious, what is the high-level workflow for simulating 1000 systems? I'm wondering if you're creating new applications/services and deleting the old between each simulation, and if so, what method you use to do so. For context, each replica holds some resources in the Ktl Log driver, which includes nonpaged memory allocations. If the replicas aren't cleaned up normally (e.g. if they are force-removed) the driver resources can be leaked until a later point at which they're cleaned up, e.g. on machine reboot.

Triplebg commented 5 years ago

Actually for almost every test I create a new cluster and delete it when I`m done. Only one application is running on this cluster. Very rarely I make changes to the code and publish the application, but don't upgrade it, to a cluster with running application on it.

In one "test" I create one service for each system after receiving the first request posting its status. After this the service should stay alive until the cluster is deleted even if there are no statuses for it and it sits idle.

When I am referring to the simulator or simulation I mean the tool we have built to send requests posting the statuses of the systems. Sorry if this is misleading. When I stop it I just stop the requests going to the cluster, but I don't change the cluster or the applications/services living in it.

The build up to 4.3GB happens after an hour and a half after starting to simulate the systems. When the memory usage is at 95% it drops sometimes to 40% total and sometimes to 60% or more. This happens with no action on my side. At this point I can see that the amount of non-paged pool bytes has dropped significantly. I can also reproduce this if I stop the simulator - there are no requests posting the statuses. On each request I update the currents status, stored in a reliable dictionary, I don't add new data.

Also every time I create a cluster I set the following settings via resources.azure.com: KtlLogger:

Diagnostics:

Settings in the Settings.xml file of the problematic stateful reliable service:

Triplebg commented 5 years ago

Today I made tests with the newest SF runtime version:

The problem still occurs. Is there anything else I can try?

evadamsms commented 5 years ago

ferrying this response along for Sunil, due to delay in getting his github account added to the organization:

Also every time I create a cluster I set the following settings via resources.azure.com: KtlLogger: SharedLogSizeInMB = 4096 WriteBufferMemoryPoolMinimumInKB = 16384 WriteBufferMemoryPoolMaximumInKB = 16384

[Ref: https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-cluster-fabric-settings#ktllogger]

You also need to configure AutomaticMemoryConfiguration to 0 [default is 1] for the above changes to come to effect. I think it should help.

But still in case if you hit this again, will it be possible to share the kernel memory dump from the machine when the memory usage is high? This will require you to force crash the machine. In case you choose to do that then before that please also configure driver verifier special pool and pool tracking options for the ktllogger.sys - verifier.exe /flags 0x00100009 /driver ktllogger.sys. Please note that changing the verifier settings could require a reboot.

Triplebg commented 5 years ago

Sorry for the late response!

I`ve set AutomaticMemoryConfiguration to 0 as advised and I have been running tests. I also removed CheckpointThresholdInMB and MaxAccumulatedBackupLogSizeInMB from the Settings.xml file of the service.

I have tested my solution with different values for WriteBufferMemoryPoolMinimumInKB and WriteBufferMemoryPoolMaximumInKB. The good news is that after an initial period of time, about an hour or so, the non-paged pool size stabilizes and is flat until the end of my test. Its size varies depending of the values of the above mentioned settings.

However the memory consumption of the process that hosts the stateful services now grows constantly and sometimes reaches >2.5GB on every node after several hours. It does not stop growing it just sometimes drops about 10% and continues growing untill I delete the cluster. Debugging its dump file with "!dumpheap -stat" shows the following or simular:

Count       TotalSize        Class Name
632797      106309896        SystemStatusInfo
833267      146654992        DeviceStatusInfo
4661662     287967186        String
2829337     1729883058       Free

SystemStatusInfo and DeviceStatusInfo are the classes I use for storing the information I need in the reliable collections. Actually in a service I have 1 instance of the SystemStatusInfo class in my reliable dictionary and up to 3 (but most of the time only 1 which is dequeued almost immediately) in my reliable queue. Each SystemStatusInfo instance contains an array with several instances of DeviceStatusInfo, depending of the number of devices in a system. So I don`t know why the count of the objects is so high in a proccess that should "hold" 750 replicas. Also as you can see most of the memory is "Free".

Before changing the AutomaticMemoryConfiguration to 0 the host process has never shown such big memory consumption. Most of the time I have observed it, the values were between 500MB and 600MB!

MedAnd commented 5 years ago

Ping - any update / progress on this issue?

githubska commented 5 years ago

Hi Sorry for the delay. A big chunk of the memory is marked as free. The garbage collection should happen on this memory at some point of time. Does it stay like that always or ultimately it is freed?

Triplebg commented 5 years ago

Yes it is freed but after a long period of time - around 2 hours. Is this normal?

About the other types of objects that get built up -SystemStatusInfo , which contains DeviceStatusInfo. I noticed that most of them are contained in

System.Fabric.Store.TInsertedItem System.Fabric.Store.TDeletedItem

derekeom commented 4 years ago

The cause may be:

If that's the case, we can try to increase the checkpoint rate with several options and see if any of them help (probably try them one at a time and see what works and what doesn't):

You should also avoid calling GC.Collect() as it would probably push the objects to gen2 heap faster, where I think gc compaction is slower.

MedAnd commented 4 years ago

My project experienced these undocumented memory behaviours and eventually we used a combination of CheckpointThresholdInMB set to 1Mb and knowledge of which SF APIs do not cache data to control unbounded memory consumption under high load. Hopefully Reliable Collections will allow configuration of runtime behaviour in a future release....

Triplebg commented 4 years ago

Yes I have already set CheckpointThresholdInMB to 1MB and it improves the memory consumption but does not solve the problem. Also I have tried setting LogTruncationIntervalSeconds to 60 but it had no effect. I`ll try different values and come back as soon as I have results.

Sorry if I didn`t understand correctly but is checkpointing related to the build up of memory marked as "Free" - currently this is the bigger portion of the memory used by the process. Or is it related only to the System.Fabric.Store.TInsertedItem and System.Fabric.Store.TDeletedItem objects.

Also which SF APIs do not cache data - I could try and use them and see if there is any difference.

MedAnd commented 4 years ago

@Triplebg - from my understanding of how the internals work, most ReliableCollection APIs such as Add, Get etc consume memory as part of caching... this memory is then held onto by Service Fabric and not released until Checkpointing... some memory seems to never be released. By switching one of our calls from Get to Key Exists (which does not cache) we reduced memory consumption from 10GB+ to 3GB range... still a long way from ideal though...

At some stage the following will be added to the official Service Fabric documentation:

ReliableCollection APIs that cache data, memory implications & relevant ReplicatorConfig settings

Triplebg commented 4 years ago

Sorry for the late response. After talking to Azure Support and multiple tests I drastically reduced the memory consumption of the services.

The main thing I learned from the communication with the support was that it is really not a good idea to have a large number of services containing small amount of data each! Memory dumps of the application showed that each service had roughly 20KB of actual data and 700KB of logs of changes in the Reliable Collections accumulated by Service Fabric. This may not be exact numbers but the difference was huge.

To reduce the number of services I combined the processing and saving of multiple systems` statuses into one service by using a kind of partitioning. I also tried using Actors. All methods worked well.

There are several other settings I used to reduce the memory consumption, but the big difference was made by changing the architecture of the services themselves: In the Settings.xml of the service itself:

In the code of the service itself:

Cluster Settings:

Now since I have multiple systems per service - sometimes up to 1500 - the StateManager.GetOrAddAsync method sometimes throws Timeout exceptions when I try to get the reference to the Reliable Collection I need. How should I use this in frequent requests? Is it safe to create the Reliable Collection in RunAsync or somewhere else and then keep its reference in a property of the service class? After this can I just use the property instead of calling StateManager.GetOrAddAsync every time?

craftyhouse commented 3 years ago

Closing this as it looks more like questions at this point vs an issue/bug. Please post your questions to https://docs.microsoft.com/en-us/answers/topics/azure-service-fabric.html