microsoft / service-fabric

Service Fabric is a distributed systems platform for packaging, deploying, and managing stateless and stateful distributed applications and containers at large scale.
https://docs.microsoft.com/en-us/azure/service-fabric/
MIT License
3.02k stars 399 forks source link

FabricApplicationGateway Blows Up #701

Open ChesneyMark opened 6 years ago

ChesneyMark commented 6 years ago

The following problem has only happened with the Service Fabric runtime 6.3.162.9494. We have not experienced this with either 6.1.x or 6.2.x runtimes.

  1. Since upgrading we find completely at random when a message is being sent through the reverse proxy after a while we will see connection refused status codes being returned. Eventually this escalates until more and more messages are being refused before FabricApplicationGateway will balloon in memory going from about 300MB in a few seconds to over 11GB in size. This can occur on one or more nodes at the same time. The resolution at the moment is that we have to kill the FabricApplicationGateway on all nodes in our cluster. Once this is done everything goes back to normal and messages are sent through the reverse proxy correctly.
  2. The second issue that exhibits the same behaviour, is that we have experienced a similar issue when upgrade our applications and microservices. On more than one occasion when upgrade our applications and services via powershell we have also seen the FabricApplicationGateway ballooning from several hundred MB to 11GB+ causing the upgrade to stall until again we have to kill the FabricApplicationGateway on all nodes.

I have attached a screenshot from task manager from one of the nodes. This screenshot is from scenario 1 described above. task manager

kavyako commented 6 years ago

Thanks for reporting the issue!

Could you share some details about the nature of requests flowing through the reverse proxy? Sample URL, GET/PUT? body size etc? What is the cluster size, number of nodes, applications, services, endpoints? Also can you share a dump of FabricApplicationGateway.exe and trace logs? Opening a support ticket will help get these routed to us so we can troubleshoot.

thanks Kavya

ChesneyMark commented 6 years ago

Apologies for the late reply. Some of the information I will need to seek permission from our customer before I can release it. Also currently most of our traffic is now bypassing the reverse proxy, for the trace we will need to change this back to the way it was, again I will need to seek their permission for this as their environment has been much more stable since we have bypassed the reverse proxy. The information that I can give you is as follows: • The cluster is a three-node on-premise cluster on an internal secured network running Windows Server 2016 with .NET 4.7.1. • We have 17 Stateless Microservices. Each service will typically have a single instance running on each node. • We have 3 Stateful Microservices. One of these is a singleton instance, the other two are setup with 3 primary partitions and 2 secondary replicas. • We have 2 Actor services. These are singleton instances using stateful storage.
o One of these is for basic Xml configuration, the other for cached data shared by the different Microservices for the same application. • All our Microservices are based on Kestrel and implement Web API interfaces (excluding our Actors) with dynamic port numbers. This means our listener’s URLs are long. I highlight this as I know in earlier versions of 6, this caused a memory leak. o All our Microservices supports GET/POST/DELETE verbs with either Xml or Json messaging. o We have one Microservice that currently supports PUT verbs for uploading of patch img patch files into reliable storage. This upload however very rarely occurs and was not used during any of the incidents with the application gateway. The img file size is approximately 80MB. • Internally in Service Fabric, our Microservices send requests via the HttpClient object and up until recently were using the reverse proxy to do this so Service Fabric would distribute the load. o Now they use FabricClient to find the local instance and then send direct. • We do have 1 Microservice that also implements a WebSocket server instance. This Microservice is setup with three partitions and currently has 900 WebSocket connections spread across the three nodes using the Int64 ranged partitioning schema for generating the partition keys. o These connections come from embedded devices that sit outside of the cluster and are still going through the reverse proxy. • There are also Windows PCs that access the Microservices using the HttpClient objects, talking to our Microservices. These again simply use our Web API interfaces implemented in the Microservices. o There will be no more than a 100 PCs accessing the cluster at any one time and these are typically simple GETs and POSTs with small amounts of data that can either be in the form of XML or Json.

I have made a few observations that might have some baring on this issue.

  1. Firstly, the same customer also has the exact same setup in two other data centres, again on secured networks, however we have not seen this behaviour. This includes them running the exact same version of the Service Fabric runtime, however they are both running .NET 4.6.1 and not .NET 4.7.1.
  2. Secondly, I have noticed that if the HttpClient object times out, it doesn’t actually close the underlying connection. It simply cancels the asynchronous task, what impact would this have on the Reverse Proxy itself? a. This particular cluster itself has been running fine, however the backend database needs some tuning as it is this causing the timeouts, but it shouldn’t impact Service Fabric itself.
  3. The last time the FabricApplicationGateway went bonkers with memory usage, I killed the application instance, then went onto the other two nodes to do the same thing. I then went back to the first node that I did the kill to see the gateway quickly go again from a few hundred MB to GBs. I simply had to kill the gateway a second time. I will talk to our customer’s head of operations this week and see what other information, especially traces that I can give you, but I hope this information helps.