microsoft / service-fabric

Service Fabric is a distributed systems platform for packaging, deploying, and managing stateless and stateful distributed applications and containers at large scale.
https://docs.microsoft.com/en-us/azure/service-fabric/
MIT License
3.03k stars 399 forks source link

Fabric.exe Continuously Using About 30% CPU on One Node Out of Five on Primary (Seed) Nodes After 7.1.417.9590 Update #983

Open brianstringfellow opened 4 years ago

brianstringfellow commented 4 years ago

After updating our Service Fabric runtime to 7.1.417.9590 from 7.0.466.9590, the Fabric.exe is continuously using about 30% CPU on one node out of five on the primary (seed) nodes. The CPU consumption is observed on seven different clusters after each had been updated. Sometimes the node affected changes over time, such as when the underlying VM is restarted.

Example from task manager: image

CPU Usage Chart When Update Deployed: image

Is the higher CPU usage a known problem?

aricamf commented 4 years ago

I have the exact observation in my side. Fabric.exe started to use around 20-40% CPU since I upgraded to 7.1.417.9590.

gkhanna79 commented 4 years ago

@brianstringfellow @aricamf Can you confirm if your cluster has applications that specify certificates for Endpoints? We had an issue in that space that has been fixed in the upcoming CU.

@dragav FYI

brianstringfellow commented 4 years ago

@gkhanna79 @dragav Yes, we have applications that specify certificates for endpoint. All of our frontend applications use port 443.

gkhanna79 commented 4 years ago

Do you have performance counters for the node when the CPU is high?

brianstringfellow commented 4 years ago

@gkhanna79 The only performance counters enabled are in the WadCfg:

"PerformanceCounters": {
  "PerformanceCounterConfiguration": [
    {
      "counterSpecifier": "\\Processor(_Total)\\% Processor Time",
      "sampleRate": "PT15S",
      "unit": "Percent"
    },
    {
      "counterSpecifier": "\\Memory\\Available MBytes",
      "sampleRate": "PT15S"
    }
  ]
}

It is interesting to see the higher CPU usage shift to other nodes over time. image

jagilber commented 4 years ago

@brianstringfellow it may be related to known issue https://github.com/Azure/Service-Fabric-Troubleshooting-Guides/blob/master/Known_Issues/Service%20Fabric%207.1%20High%20CPU%20Fabric.exe%20One%20Node.md