Open ChesneyMark opened 6 years ago
Thanks for reporting the issue!
Could you share some details about the nature of requests flowing through the reverse proxy? Sample URL, GET/PUT? body size etc? What is the cluster size, number of nodes, applications, services, endpoints? Also can you share a dump of FabricApplicationGateway.exe and trace logs? Opening a support ticket will help get these routed to us so we can troubleshoot.
thanks Kavya
Apologies for the late reply. Some of the information I will need to seek permission from our customer before I can release it. Also currently most of our traffic is now bypassing the reverse proxy, for the trace we will need to change this back to the way it was, again I will need to seek their permission for this as their environment has been much more stable since we have bypassed the reverse proxy. The information that I can give you is as follows:
• The cluster is a three-node on-premise cluster on an internal secured network running Windows Server 2016 with .NET 4.7.1.
• We have 17 Stateless Microservices. Each service will typically have a single instance running on each node.
• We have 3 Stateful Microservices. One of these is a singleton instance, the other two are setup with 3 primary partitions and 2 secondary replicas.
• We have 2 Actor services. These are singleton instances using stateful storage.
o One of these is for basic Xml configuration, the other for cached data shared by the different Microservices for the same application.
• All our Microservices are based on Kestrel and implement Web API interfaces (excluding our Actors) with dynamic port numbers. This means our listener’s URLs are long. I highlight this as I know in earlier versions of 6, this caused a memory leak.
o All our Microservices supports GET/POST/DELETE verbs with either Xml or Json messaging.
o We have one Microservice that currently supports PUT verbs for uploading of patch img patch files into reliable storage. This upload however very rarely occurs and was not used during any of the incidents with the application gateway. The img file size is approximately 80MB.
• Internally in Service Fabric, our Microservices send requests via the HttpClient object and up until recently were using the reverse proxy to do this so Service Fabric would distribute the load.
o Now they use FabricClient to find the local instance and then send direct.
• We do have 1 Microservice that also implements a WebSocket server instance. This Microservice is setup with three partitions and currently has 900 WebSocket connections spread across the three nodes using the Int64 ranged partitioning schema for generating the partition keys.
o These connections come from embedded devices that sit outside of the cluster and are still going through the reverse proxy.
• There are also Windows PCs that access the Microservices using the HttpClient objects, talking to our Microservices. These again simply use our Web API interfaces implemented in the Microservices.
o There will be no more than a 100 PCs accessing the cluster at any one time and these are typically simple GETs and POSTs with small amounts of data that can either be in the form of XML or Json.
I have made a few observations that might have some baring on this issue.
The following problem has only happened with the Service Fabric runtime 6.3.162.9494. We have not experienced this with either 6.1.x or 6.2.x runtimes.
I have attached a screenshot from task manager from one of the nodes. This screenshot is from scenario 1 described above.