Open y0zg opened 5 months ago
Hi @y0zg, are you seeing descheduler remove pods with other status codes besides Error
, ContainerStatusUnknown
, and Completed
? And are the pods with these statuses at least 3600 seconds old as your minPodLifetimeSeconds
@knelasevero @a7i @ingvagabund I don't remember exactly how we check pod eviction reasons/statuses. Did we have any updated docs on what exactly is covered?
The problem that I don't see that descheduler removes pods at all. I tried to set lifetime to 60sec with frequent cronjob run.
One more interesting fact , in log I see this warning while running 0.27 version on 1.27 k8s version
W0118 16:51:01.043860 1 descheduler.go:127] Warning: Descheduler minor version 27 is not supported on your version of Kubernetes 1.27+. See compatibility docs for more info: https://github.com/kubernetes-sigs/descheduler#compatibility-matrix
I tried to use descheduler 0.28 version and with it above compatibility warning isn't present
below is helm template for configmap
---
# Source: descheduler/templates/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: release-name-descheduler
namespace: default
labels:
app.kubernetes.io/name: descheduler
helm.sh/chart: descheduler-0.27.1
app.kubernetes.io/instance: release-name
app.kubernetes.io/version: "0.27.1"
app.kubernetes.io/managed-by: Helm
data:
policy.yaml: |
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
ignorePvcPods: true
profiles:
- name: ProfileName
pluginConfig:
- args:
maxPodLifeTimeSeconds: 60
states:
- ContainerStatusUnknown
- Completed
- Error
- Evicted
name: PodLifeTime
- args:
minPodLifetimeSeconds: 60
reasons:
- CreateContainerConfigError
- Error
- Completed
- Evicted
- ContainerStatusUnknown
name: RemoveFailedPods
plugins:
deschedule:
enabled:
- RemoveFailedPods
strategies:
HighNodeUtilization:
enabled: true
params:
namespaces:
exclude:
- kube-system
- jenkins
- logs
nodeResourceUtilizationThresholds:
thresholds:
cpu: 20
memory: 20
LowNodeUtilization:
enabled: false
params:
nodeResourceUtilizationThresholds:
targetThresholds:
cpu: 50
memory: 50
pods: 50
thresholds:
cpu: 20
memory: 20
pods: 20
RemoveDuplicates:
enabled: true
RemoveFailedPods:
enabled: true
params:
failedPods:
reasons:
- Failed
RemovePodsHavingTooManyRestarts:
enabled: true
params:
podsHavingTooManyRestarts:
includingInitContainers: true
podRestartThreshold: 100
RemovePodsViolatingInterPodAntiAffinity:
enabled: true
RemovePodsViolatingNodeAffinity:
enabled: true
params:
nodeAffinityType:
- requiredDuringSchedulingIgnoredDuringExecution
RemovePodsViolatingNodeTaints:
enabled: true
RemovePodsViolatingTopologySpreadConstraint:
enabled: true
params:
includeSoftConstraints: false
The warning message about compatibility was addressed here which landed in v0.28.1 and v0.29.0 but not v0.27.x
As far as your second question, RemoveFailedPods
only looks at pod phase reason, and not container status. For example CreateContainerConfigError
is not a Pod Phase, but rather a Container Status.
Additionally, reason Completed
will not work with RemoveFailedPods
because this strategy only looks at pods in Failed
phase. Pod phase for Completed
is always Succeeded
.
All of that to say that you may want to use PodLifeTime
instead since it checks both pod status reasons and container status reasons (ref).
Question for maintainers - we should either merge the two strategies and formally retire RemoveFailedPods
or be consistent in the reasons that we check (check pod status and container status in both strategies).
Question for maintainers - we should either merge the two strategies and formally retire RemoveFailedPods
Similar request for merging strategies is in https://github.com/kubernetes-sigs/descheduler/issues/1169. Worth extending it with RemoveFailedPods
for further analysis and discussion.
Since I shared the config (for sure it might contain incorrect settings as I was testing various options), can you check what should be corrected in order to delete "failed" pods, i.e. pods with status: Error
Much appreciate!
The warning message about compatibility was addressed here which landed in v0.28.1 and v0.29.0 but not v0.27.x
As far as your second question,
RemoveFailedPods
only looks at pod phase reason, and not container status. For exampleCreateContainerConfigError
is not a Pod Phase, but rather a Container Status.Additionally, reason
Completed
will not work withRemoveFailedPods
because this strategy only looks at pods inFailed
phase. Pod phase forCompleted
is alwaysSucceeded
.All of that to say that you may want to use
PodLifeTime
instead since it checks both pod status reasons and container status reasons (ref).Question for maintainers - we should either merge the two strategies and formally retire
RemoveFailedPods
or be consistent in the reasons that we check (check pod status and container status in both strategies).
I stumbled upon this as i want to cleanup some "Completed" Pods which are not owned by jobs. First try with "RemoveFailedPods" ignored the "Completed" pods as you mentioned. Maybe an hint in the plugin documentation make this clear. Second try using "PodLifeTime" shows me the following errors:
- name: "PodLifeTime"
args:
maxPodLifeTimeSeconds: 604800 # 7 days
states:
- "Completed"
E0209 11:23:10.180425 1 server.go:96] "descheduler server" err="in profile ProfileName: states must be one of [Running Pending PodInitializing ContainerCreating ImagePullBackOff]"
E0209 11:23:10.180523 1 run.go:74] "command failed" err="in profile ProfileName: states must be one of [Running Pending PodInitializing ContainerCreating ImagePullBackOff]"
- name: "PodLifeTime"
args:
maxPodLifeTimeSeconds: 604800 # 7 days
podStatusPhases:
- "Completed"
E0209 11:26:43.971091 1 server.go:96] "descheduler server" err="failed decoding descheduler's policy config \"/policy-dir/policy.yaml\": strict decoding error: unknown field \"podStatusPhases\""
E0209 11:26:43.971142 1 run.go:74] "command failed" err="failed decoding descheduler's policy config \"/policy-dir/policy.yaml\": strict decoding error: unknown field \"podStatusPhases\""
descheduler:v0.29.0 k8s: v1.28.6
anything else i can try?
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
descheduler version: v0.27.1 helm chart version: 0.27.1
Does this issue reproduce with the latest release? Haven't tested
Which descheduler CLI options are you using?
Please provide a copy of your descheduler policy config file
helm values file
What k8s version are you using (
kubectl version
)?k8s version: v1.27.8-eks-8cb36c9
I don't see that descheduler removes pods with status
Error
,ContainerStatusUnknown
,Completed