microsoft / service-fabric

Service Fabric is a distributed systems platform for packaging, deploying, and managing stateless and stateful distributed applications and containers at large scale.
https://docs.microsoft.com/en-us/azure/service-fabric/
MIT License
3.03k stars 401 forks source link

[BUG] - CPU Resource Governance not being applied #1493

Closed ggrillo closed 7 months ago

ggrillo commented 7 months ago

Describe the bug When implementing CpuPercent Resource Governance for all services deployed to a node as described in the ASF RG docs, we do not see the resource governance getting applied when the node CPU usage is high.

Area/Component: Resource Governance

To Reproduce Create a CpuPercent RG policy in the service application manifest as below, hit several service endpoints with high CPU usage and observe the cluster as not applying the requested CpuPercent RG.

`

<Policies>
  <ServicePackageResourceGovernancePolicy CpuCoresLimit="1"/>
  <ResourceGovernancePolicy CodePackageRef="Code" CpuPercent="10" MemoryInMBLimit="1024" />
</Policies>

`

Expected behavior If I have deployed all the services in the cluster with a similar manifest as above (the ServiceManifestName would be unique but not the CodePackageRef), I would expect the cluster to limit the CPU usage to ~10 percent for the services since they are all contending for CPU resources at the same time.

Observed behavior: The CPU allocation is not getting applied consistently at ~10% across all services. It seems as the first few services that spike the CPU get well over 10% and then the remaining services are contending for CPU well under the 10% threshold.

Service Fabric Runtime Version: v10.0.1949

Environment:

  • Azure]
  • OS: Windows 2019
  • Version: 2019.0.1712598402
  • All services are Windows .exe + dlls, we are not using Docker containers

Additional context I was able to observe cluster CPU RG being applied when I used the "CpuCores" parameter but this won't work for our deployments as it limits the service node placement and with many services running on the same node, deployments start to fail due to no available node (ie: all the available CPU has been allocated).

My questions are: . Is this the correct RG config for Windows services to implement CpuPercent resource governance? . Does CpuPercent or CpuShares RG work for services deployed as Windows executables?
. The docs are somewhat unclear about what is a "container" as CpuPercent and CpuShares (which I also tried) can only be applied to "containers". If this means docker container only, we have the answer why it's not working.

Thanks very much, GregG


Assignees: /cc @microsoft/service-fabric-triage

ggrillo commented 7 months ago

The problem with the CPU resource governance not being applied correctly is due to the way we have our applications deployed to the cluster. We have a 1:1 application/service deployment model. Meaning, if we have 30 "services", we deploy 30 applications in the cluster, one application for each service as opposed to 1 application with 30 services.

The "CpuPercent" or "CpuShares" RG policies apply at the application/service level. If our deployment model was 1 application and then all services under the single application, the "CpuPercent" or "CpuShares" would limit the CPU resources for the services in the application.

Also note, "CpuPercent" and "CpuShares" is a request to limit CPU resources. If there are no competing services requiring CPU resources in the application, the cluster will allow the service to use as much of the CPU that is available to the cluster effectively leaving CPU governance to the operating system scheduler.