intel / kubernetes-power-manager

Apache License 2.0
83 stars 18 forks source link

Power profiles have allocatable limits set by default #67

Closed sahanirn closed 5 months ago

sahanirn commented 11 months ago

@adorney99 Would like to understand here - when the power profiles(basic) have been deployed, describing the node's allocatable resources shows each profile has certain limit set beyond which we cannot allocate. If we try to exceed the limit, then the pod with the specified profile remains in pending state. How this works, is there a way to change or increase the allocatable value for every profile. For ex: total number of cores: 64 Allocatable values after describing node: "power.intel.com/balance-performance": "38", "power.intel.com/balance-power": "51", "power.intel.com/performance": "25", "power.intel.com/shared": "64"

Also, when created a custom profile with epp value power, the allocatable value for custom profile was the total number of cores(64), but when a custom profile with epp value performance was created, the allocatable value was same as the base profile(performance)i.e.25.

adorney99 commented 11 months ago

Hi @sahanirn, the reason we set these resource limits is because having too many cores using a profile will lower the frequency ranges they can meet. For example having too many cores in performance will mean they all try to achieve their maximum frequency which brings down the frequency they'll actually get to. The reason custom profiles with no epp or and epp of power work is because epp is what we use to decide how many resources to create and those two just allocate the full number of cores. This is because an empty epp could mean anything and an epp of power is typically used for low frequencies like one used for a shared pool. In reality the number of resources we allocate mean that if you were to allocate most of them or use a high shared pool frequency you'd still run into cores not being able to reach their frequency limits but we have these limits to prevent things getting too out of hand.

I do see your point about pods being stuck in pending which is probably less desirable than having pods with much lower frequencies than what you asked for. Might be a potential future feature idea to let users decide what they want to happen here instead of deciding for them based on epp values. Thanks for the feedback

engineerbharath12 commented 11 months ago

@adorney99 thanks for response to this query . Couple of follow-up questions based on this response 1) What is the impact that is expected if majority of the cores in the CPU ask for higher frequencies ? If i have a high demand workload which occupies 70% of the node capacity and all of the cores might need to be tuned for higher frequency . What challenges do we expect of such a workload on the system ? Excessive power consumption and heat generation or power =-supply limitations 2) While you consider the above as a feature request, is it possible to override these default values for now i.e if i had to change 25 to 40 for performance etc or is there any workaroud ?

Allocatable values after describing node: "power.intel.com/balance-performance": "38", "power.intel.com/balance-power": "51", "power.intel.com/performance": "25", "power.intel.com/shared": "64

adorney99 commented 11 months ago

Hi @engineerbharath12, having a high number of cores tuned for a higher frequency will lead to higher power consumption and more than likely increased temperatures as well. Another drawback is those cores not being able to reach the frequency range they're aiming for. The higher the target frequency/number of cores the lower the actual frequency those cores will be able to achieve. This is much more likely to happen if you have features enabled that allow your cpu to exceed its base frequency.

As a workaround you could use a custom profile with a blank epp. Depending on the system, epp shouldn't be too much of a concern if you have the right governor and frequency range

adorney99 commented 11 months ago

Alternatively if you're alright with building the images locally you could change the percentages here to 1.0 and rebuild the images which should do the trick

engineerbharath12 commented 10 months ago

@adorney99 thanks for your response. Understood the resasons behind limiting the performance cores. However, if a HPC workload(s) deployed in that particular server demands for higher frequency across all it's assigned cores then we should be able to provide that many performance cores to those workloads, right. Say 70% of the node's capacity is needed for HPC workloads which needs performance profile. Do you think enabling Turbo Boost (Turbo Mode) or any downsides in allocating 70% of the node's capacity to performance profile ?

adorney99 commented 10 months ago

@engineerbharath12 putting 70% of the nodes cores in the performance profile will definitely lower the maximum frequency those cores can achieve (they almost definitely won't reach the actual frequency range set for them) but other than that I don't think there should be any other downsides if your goal is to have those cores operating at the highest frequency they can manage

adorney99 commented 5 months ago

closing this issue as a backlog story was generated to either remove these limits or provide users a way to customize them