microsoft / service-fabric

Service Fabric is a distributed systems platform for packaging, deploying, and managing stateless and stateful distributed applications and containers at large scale.
https://docs.microsoft.com/en-us/azure/service-fabric/
MIT License
3.03k stars 401 forks source link

Stateless Service High Memory Consumption #714

Open aliazamrana opened 6 years ago

aliazamrana commented 6 years ago

I have a 5 node cluster and I have a .NET API stateful service that is being consumed by my dashboard and multiple guest executables. To get the url of that service I have a statelss service that is running on all 5 nodes and it is also a .NET API that's sole purpose is to get the url for the stateful service. Now after only 3-4 days of running my stateless service for the getting the url is using huge amounts of RAM on the nodes and causing issues with guest executables to run properly. It is taking more than 2600 MB on each node.

I am not sure why the service is taking this much memory the api controller I am using in the service is this

        [ResponseCache(Location = ResponseCacheLocation.None, NoStore = true)]
        [HttpGet]
        public async Task<string> Get()
        {
            string endPoint="";
            var fabricClient = new FabricClient();
            var apps = await fabricClient.QueryManager.GetApplicationListAsync();
            foreach (var app in apps)
            {
                System.Diagnostics.Debug.WriteLine($"Discovered application:'{app.ApplicationName}");

                var services = await fabricClient.QueryManager.GetServiceListAsync(app.ApplicationName);
                foreach (var service in services)
                {
                    System.Diagnostics.Debug.WriteLine($"Discovered Service:'{service.ServiceName}");

                    var partitions = await fabricClient.QueryManager.GetPartitionListAsync(service.ServiceName);
                    if (service.ServiceKind != System.Fabric.Query.ServiceKind.Stateful )
                    {
                        continue;
                    }
                    else if (!service.ServiceTypeName.Contains("TradingController"))
                    {
                        continue;
                    }
                    foreach (var partition in partitions)
                    {
                        System.Diagnostics.Debug.WriteLine($"Discovered Service Partition:'{partition.PartitionInformation.Kind} {partition.PartitionInformation.Id}");

                        ServicePartitionKey key=new ServicePartitionKey();

                        switch (partition.PartitionInformation.Kind)
                        {
                            case ServicePartitionKind.Singleton:
                                key = ServicePartitionKey.Singleton;
                                break;
                            case ServicePartitionKind.Int64Range:
                                var longKey = (Int64RangePartitionInformation)partition.PartitionInformation;
                                key = new ServicePartitionKey(longKey.LowKey);
                                break;
                            case ServicePartitionKind.Named:
                                var namedKey = (NamedPartitionInformation)partition.PartitionInformation;
                                key = new ServicePartitionKey(namedKey.Name);
                                break;
                            default:
                                break;

                        }

                        var resolver = new ServicePartitionResolver();

                        var resolved = await resolver.ResolveAsync(service.ServiceName, key, CancellationToken.None);
                        if (service.ServiceKind == System.Fabric.Query.ServiceKind.Stateful)
                        {
                            endPoint = resolved.Endpoints.FirstOrDefault().Address;                           
                        }
                    }
                }
            }

            int index = endPoint.IndexOf("http");
            string url = endPoint.Substring(index);

            var charsToRemove = new string[] { "}", "\\", "\"" };

            foreach (var c in charsToRemove)
            {
                url = url.Replace(c, string.Empty);
            }

            return url;
        }

Now this controller is being hit continuously and I tried adding no caching switch to test if the issue was related to too many requests being cached but its not still high memory consumption

I hope this issue gets answered and not just left over like many others I posted.

ashishnegi commented 6 years ago

@aliazamrana You will need to take memory dump of your app and see which objects are taking the memory. Lets start investigation from there.

aliazamrana commented 6 years ago

@ashishnegi I can take it but the process is taking huge amounts of RAM and it stucks in between the dumps I can get a mini dump though via process explorer

ashishnegi commented 6 years ago

@aliazamrana Lets start with that. Try attaching the process in Visual Studio or similar tools Or take dumps and analyze in similar tools to see which objects are taking most of memory.

aliazamrana commented 6 years ago

image @ashishnegi This is what I get but my service's purpose is to handle webrequests although this is after I tried to update my controllers to overcome any strong references

   [ResponseCache(Location = ResponseCacheLocation.None, NoStore = true)]
        [HttpGet]
        public async Task<string> Get()
        {
            string endPoint="";
            //var fabricClient = new FabricClient();
            using (var fabricClient = new FabricClient())
            {
                var apps = await fabricClient.QueryManager.GetApplicationListAsync();
                foreach (var app in apps)
                {
                    System.Diagnostics.Debug.WriteLine($"Discovered application:'{app.ApplicationName}");

                    var services = await fabricClient.QueryManager.GetServiceListAsync(app.ApplicationName);
                    foreach (var service in services)
                    {
                        System.Diagnostics.Debug.WriteLine($"Discovered Service:'{service.ServiceName}");

                        var partitions = await fabricClient.QueryManager.GetPartitionListAsync(service.ServiceName);
                        if (service.ServiceKind != System.Fabric.Query.ServiceKind.Stateful)
                        {
                            continue;
                        }
                        else if (!service.ServiceTypeName.Contains("TradingController"))
                        {
                            continue;
                        }
                        foreach (var partition in partitions)
                        {
                            System.Diagnostics.Debug.WriteLine($"Discovered Service Partition:'{partition.PartitionInformation.Kind} {partition.PartitionInformation.Id}");

                            ServicePartitionKey key = new ServicePartitionKey();

                            switch (partition.PartitionInformation.Kind)
                            {
                                case ServicePartitionKind.Singleton:
                                    key = ServicePartitionKey.Singleton;
                                    break;
                                case ServicePartitionKind.Int64Range:
                                    var longKey = (Int64RangePartitionInformation)partition.PartitionInformation;
                                    key = new ServicePartitionKey(longKey.LowKey);
                                    break;
                                case ServicePartitionKind.Named:
                                    var namedKey = (NamedPartitionInformation)partition.PartitionInformation;
                                    key = new ServicePartitionKey(namedKey.Name);
                                    break;
                                default:
                                    break;
                                    //throw new ArgumentOutOfRangeException("partition.PartitionInformation.Kind");
                            }

                            var resolver = new ServicePartitionResolver();

                            var resolved = await resolver.ResolveAsync(service.ServiceName, key, CancellationToken.None);
                            //foreach (var endpoint in resolved.Endpoints)
                            //{
                            //    System.Diagnostics.Debug.WriteLine($"Discovered Service Endpoint:'{endpoint.Address}");
                            //}

                            if (service.ServiceKind == System.Fabric.Query.ServiceKind.Stateful)
                            {
                                endPoint = resolved.Endpoints.FirstOrDefault().Address;
                            }
                        }
                        partitions = null;
                    }
                    services = null;

                }
                apps = null;
            }

            int index = endPoint.IndexOf("http");
            string url = endPoint.Substring(index);

            var charsToRemove = new string[] { "}", "\\", "\"" };

            foreach (var c in charsToRemove)
            {
                url = url.Replace(c, string.Empty);
            }
            charsToRemove = null;
            GC.Collect();
            return url;
        }

And this is the post controller part I guess this also affected somehow

        [ResponseCache(Location = ResponseCacheLocation.None, NoStore = true)]
        [Route("Redirect/{*endurl}")]
        [HttpPost]
        public async Task<JsonResult> Post(string endurl,[FromBody] JToken mT4Result)
        {
            var urlController = new URLFinderController();
            string url = await urlController.Get();
            urlController = null;
            url += "/" + endurl;
            JObject jObject = JObject.FromObject(mT4Result);
            var client = new RestClient(url);
            //var request = new RestRequest(endurl, Method.POST);
            var request = new RestRequest(Method.POST); //RestRequest(Method.POST);
            //request.Timeout = 

            request.RequestFormat = DataFormat.Json;
            request.AddBody(jObject.ToString(Newtonsoft.Json.Formatting.None));
            var reply = client.ExecuteAsPost(request, "Post");

            //request.AddBody(jObject);
            client = null;
            request = null;
            var content = reply.Content.Clone();
            reply = null;
            GC.Collect();
            return new JsonResult(content);
        }
ashishnegi commented 6 years ago

although this is after I tried to update my controllers to overcome any strong references

@aliazamrana If I understand the memory snapshot correctly, these objects are taking memory only in ~4 MBs. Do you mean that after you took care of strong references, memory usage decreased ? Is this correct dump of process that is taking 2 GB RAM ? It is still not clear which objects are taking memory ?

You can try profiling with and without Service Fabric code to eliminate other issues. Let the service reach GB's of memory before taking dump.

One optimizaton : You should not create using (var fabricClient = new FabricClient()) again. Just store this in some static variable.

aliazamrana commented 6 years ago

@ashishnegi This is the snapshot after 1 night of updated code and after the cluster deployment when the services were running and everything started the memory consumed by this service was 125-180 MBs which I guess is normal but after just 14 Hours the memory consumption was 775 MB and this snapshot was of that time

Sorted on Size image

Sorted on Inclusive Size image

Sorted on Counts image

And about this

One optimizaton : You should not create using (var fabricClient = new FabricClient()) again. Just store this in some static variable.

I tried that but somehow this using block was more efficient and was using a little less memory when I was debugging on local cluster so I went with that approach although I am not sure why are you suggesting it is not a good approach because if I am not wrong this way it will not have a strong reference to the object and at times if a service goes down and the static object is created again I am not sure if Garbage Collector will be able to remove that strong reference, anyway I will try with your suggestion for another day to see what happens

ashishnegi commented 6 years ago

To me, these numbers are still less than what I expect. Can you put constraint on the app to use only 400 MB of RAM and just increase the load till it reaches its limit ? If it recovers memory, then it means that it is just GC long lived objects. If not, then yes someone is holding that memory.

If not managed, this might be a leak in native code. Can you look on memory analysis of native code ? Or finding what %age of memory is managed and native and how it changes gradually will help as well.

aliazamrana commented 6 years ago

Ok I will try to limit the usage but for now the process seems to stick to the 780 MBs

aliazamrana commented 6 years ago

@ashishnegi This is the 2.91 GB dump file for the service process and it still shows very little memory not sure why is it taking this much size image

ashishnegi commented 6 years ago

@aliazamrana Lets try a native (c++) dump analyzer to see similar snapshot for native objects.

MedAnd commented 6 years ago

Hi @aliazamrana, out of interest are you using server or desktop gc?

aliazamrana commented 6 years ago

@MedAnd Using a server

MedAnd commented 6 years ago

@aliazamrana - having experienced similar issues you might want to try workstation garbage collection (which is the default):

<gcServer enabled="false"/>

aliazamrana commented 6 years ago

@MedAnd I am currently on developing the web services on service fabric so is it something you enable on development side or is it done through azure management portal when configuring the cluster because I may need to forward this report to particular person

MedAnd commented 6 years ago

@aliazamrana - this is done in your project, for example if your App.config looks something like:

image

try setting:

<gcServer enabled="false"/>

aliazamrana commented 6 years ago

@MedAnd it is a .NET Core web service and I dont have any config files available although it has json configurations and I have tried to convert this to json and placed it in my appsettings.json file to see if this works

MedAnd commented 6 years ago

@aliazamrana - for .Net Core runtimeOptions maybe try:

{ "configProperties": { "System.GC.Server": false } }

`

true

`