Closed jia2 closed 4 years ago
/kind bug
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
@jia2 can you try the latest release, v0.18.0, and see if you see the same behavior?
@seanmalloy I tried with v0.18.0. It doesn't work yet. Here are loads of work nodes in my k8s cluster: [jia@10-105-21-115 (⎈ |abn:kube-system)] ~ $ k top node NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% ip-10-107-208-158.eu-central-1.compute.internal 1392m 35% 12238Mi 43% ip-10-107-208-54.eu-central-1.compute.internal 1138m 29% 18999Mi 67% ip-10-107-208-9.eu-central-1.compute.internal 1809m 46% 17296Mi 61% ip-10-107-209-10.eu-central-1.compute.internal 1275m 32% 19591Mi 69% ip-10-107-209-14.eu-central-1.compute.internal 1159m 29% 20616Mi 73% ip-10-107-209-65.eu-central-1.compute.internal 1815m 46% 18039Mi 64% ip-10-107-210-10.eu-central-1.compute.internal 1425m 36% 20906Mi 74% ip-10-107-210-110.eu-central-1.compute.internal 1565m 39% 17132Mi 60%
for example, on worker node ip-10-107-208-54.eu-central-1.compute.internal I can see 3 deployments with duplicates pod are running. After running descheduler job, those pods are still running on the same worker node.
I checked the log of the job, the work node was processed by descheduler,
11:51:57.466555 1 duplicates.go:49] Processing node: "ip-10-107-208-54.eu-central-1.compute.internal"
And the request below also got a response of PodList, however it is truncated due to too long text in log
11:51:57.498850 1 round_trippers.go:443] GET https://172.20.0.1:443/api/v1/pods?fieldSelector=spec.nodeName%3Dip-10-107-208-54.eu-central-1.compute.internal%2Cstatus.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded 200 OK in 32 milliseconds
I'm not sure, what did I wrong, that the strategy "RemoveDuplicates" doesn't work for me.
/remove-lifecycle stale
@jia2 please provide the below information, so that we can continue to assist with troubleshooting.
kubectl describe pod
output for all of the pods in questiondescheduler CLI option
command:
- "/bin/descheduler"
args:
- "--policy-config-file"
- "/policy-dir/policy.yaml"
- "--v"
- "9"
descheduler policy ConfigMap
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
"RemoveDuplicates":
enabled: true
"RemovePodsViolatingInterPodAntiAffinity":
enabled: true
"LowNodeUtilization":
enabled: true
params:
nodeResourceUtilizationThresholds:
thresholds:
"cpu" : 20
"memory": 20
"pods": 20
targetThresholds:
"cpu" : 70
"memory": 70
"pods": 50
namespace that the pods in question are running in It is in a custom namespace, cloudia-abn
kubectl describe pod output
k describe pod ppr-timetableout-camadapter-v01-deploy-7589b4858c-fkgmm
Name: ppr-timetableout-camadapter-v01-deploy-7589b4858c-fkgmm
Namespace: cloudia-abn
Priority: 1000600
Priority Class Name: ppr-priorityclass
Node: ip-10-107-209-28.eu-central-1.compute.internal/10.107.209.28
Start Time: Thu, 11 Jun 2020 01:20:18 +0200
Labels: app=cloudia
creation_date=08.06.20_1535
docker_image_tag=OW_AMQ_WMQ_LOOP-v0.1.1624-SADe_06_2020
environment=cloudia-abn
name=ppr-timetableout-camadapter-v01-pod
pod-template-hash=7589b4858c
serviceversion=ppr-timetableout-camadapter-v01_1.3.3_3573
svc=ppr-timetableout-camadapter-v01
Annotations: kubernetes.io/psp: eks.privileged
Status: Running
IP: 10.107.209.15
Controlled By: ReplicaSet/ppr-timetableout-camadapter-v01-deploy-7589b4858c
Init Containers:
ppr-timetableout-camadapter-v01-init:
Container ID: docker://cc56c328a4b637cdbfc9fc3e1eeaa395bff63d6a8377b980d6092fef9c403bff
Image: 368971480733.dkr.ecr.eu-central-1.amazonaws.com/cloudia/aws-cli:1.18
Image ID: docker-pullable://368971480733.dkr.ecr.eu-central-1.amazonaws.com/cloudia/aws-cli@sha256:37ff2cda184f684c87732cd4d6fc5dd263bac99a149a00e789f7244f99c09b81
Port: <none>
Host Port: <none>
Command:
/bin/sh
-c
aws s3 cp --quiet s3://$(S3_BUCKET)/$(S3_SERVICE_FOLDER) . --recursive;
State: Terminated
Reason: Completed
Exit Code: 0
Started: Thu, 11 Jun 2020 01:20:38 +0200
Finished: Thu, 11 Jun 2020 01:20:45 +0200
Ready: True
Restart Count: 0
Environment:
S3_SERVICE_FOLDER: cloudia-abn/services/ppr-timetableout-camadapter-v01
S3_BUCKET: 368971480733-cloudia-deploy
Mounts:
/project from configdir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-8rts8 (ro)
Containers:
ppr-timetableout-camadapter-v01:
Container ID: docker://517b0e94943d12e575369e88e7a3f6f74c0808d5882b33c87bf5ec0a01fc4b5c
Image: 368971480733.dkr.ecr.eu-central-1.amazonaws.com/cloudia/ow_amq_wmq_loop:OW_AMQ_WMQ_LOOP-v0.1.1624-SADe_06_2020
Image ID: docker-pullable://368971480733.dkr.ecr.eu-central-1.amazonaws.com/cloudia/ow_amq_wmq_loop@sha256:a27e221ef8d6a4840a060eec390727f88e0927e33ebbe03766e7ff45eab607fc
Ports: 8443/TCP, 8084/TCP
Host Ports: 0/TCP, 0/TCP
Command:
/bin/sh
-c
cd /opt/cloudia/ppr/ppr-timetableout-camadapter-v01; java -DSERVICE_CONTAINER=ppr-timetableout-camadapter-v01 -Dspring.profiles.active=cloudia-abn -Dcom.ibm.mq.cfg.useIBMCipherMappings=false -Dspring.config.location=file:/opt/config/app-config.yaml,file:/opt/config/env.yaml,file:/opt/cloudia-config-secret/cloudia-config-secret.yaml,file:/opt/mq-config-secret/mq-connection.yaml,classpath:config/application-generic-aws.yml,classpath:config/application-utils-aws.yml -Xms1000m -Xmx1000m -jar /opt/ow-amq-wmq-loop.jar
State: Running
Started: Thu, 11 Jun 2020 01:21:32 +0200
Ready: True
Restart Count: 0
Limits:
cpu: 400m
memory: 1000Mi
Requests:
cpu: 100m
memory: 500Mi
Liveness: http-get http://:8084/PPR/Timetableout/CAMAdapter/V01/health delay=200s timeout=10s period=80s #success=1 #failure=5
Readiness: http-get http://:8084/PPR/Timetableout/CAMAdapter/V01/health delay=220s timeout=5s period=50s #success=1 #failure=3
Environment:
STAKATER_PPR_TIMETABLEOUT_CAMADAPTER_V01_CM_CONFIGMAP: 0f49f7a571f36e29f7482eff42e568097fd377e5
STAKATER_MQ_CONFIG_SECRET_SECRET: a715da5e437d474638e45bafc8ba7d755da423a5
STAKATER_CLOUDIA_TRUSTSTORE_CM_CONFIGMAP: da39a3ee5e6b4b0d3255bfef95601890afd80709
STAKATER_CLOUDIA_CONFIG_SECRET_SECRET: 52370aed941e0d62c296351cfe5d258b8128949f
Mounts:
/etc/certs from certs-volume (rw)
/opt/cloudia-config-secret from cloudia-config-volume (ro)
/opt/cloudia/ppr/ppr-timetableout-camadapter-v01 from configdir (rw)
/opt/config from appconfig-volume (rw)
/opt/mq-config-secret from mq-config-volume (ro)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-8rts8 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady True
PodScheduled True
Volumes:
configdir:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
appconfig-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: ppr-timetableout-camadapter-v01-cm
Optional: false
certs-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: cloudia-truststore-cm
Optional: false
cloudia-config-volume:
Type: Secret (a volume populated by a Secret)
SecretName: cloudia-config-secret
Optional: false
mq-config-volume:
Type: Secret (a volume populated by a Secret)
SecretName: mq-config-secret
Optional: false
default-token-8rts8:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-8rts8
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events: <none>
k describe pod ppr-timetableout-camadapter-v01-deploy-7589b4858c-f4gmg
Name: ppr-timetableout-camadapter-v01-deploy-7589b4858c-f4gmg
Namespace: cloudia-abn
Priority: 1000600
Priority Class Name: ppr-priorityclass
Node: ip-10-107-209-28.eu-central-1.compute.internal/10.107.209.28
Start Time: Thu, 11 Jun 2020 01:20:21 +0200
Labels: app=cloudia
creation_date=08.06.20_1535
docker_image_tag=OW_AMQ_WMQ_LOOP-v0.1.1624-SADe_06_2020
environment=cloudia-abn
name=ppr-timetableout-camadapter-v01-pod
pod-template-hash=7589b4858c
serviceversion=ppr-timetableout-camadapter-v01_1.3.3_3573
svc=ppr-timetableout-camadapter-v01
Annotations: kubernetes.io/psp: eks.privileged
Status: Running
IP: 10.107.209.247
Controlled By: ReplicaSet/ppr-timetableout-camadapter-v01-deploy-7589b4858c
Init Containers:
ppr-timetableout-camadapter-v01-init:
Container ID: docker://085cc93a79301ef1a8cca273db5d177802a3856c2b645497dece008c78d16c40
Image: 368971480733.dkr.ecr.eu-central-1.amazonaws.com/cloudia/aws-cli:1.18
Image ID: docker-pullable://368971480733.dkr.ecr.eu-central-1.amazonaws.com/cloudia/aws-cli@sha256:37ff2cda184f684c87732cd4d6fc5dd263bac99a149a00e789f7244f99c09b81
Port: <none>
Host Port: <none>
Command:
/bin/sh
-c
aws s3 cp --quiet s3://$(S3_BUCKET)/$(S3_SERVICE_FOLDER) . --recursive;
State: Terminated
Reason: Completed
Exit Code: 0
Started: Thu, 11 Jun 2020 01:20:33 +0200
Finished: Thu, 11 Jun 2020 01:20:43 +0200
Ready: True
Restart Count: 0
Environment:
S3_SERVICE_FOLDER: cloudia-abn/services/ppr-timetableout-camadapter-v01
S3_BUCKET: 368971480733-cloudia-deploy
Mounts:
/project from configdir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-8rts8 (ro)
Containers:
ppr-timetableout-camadapter-v01:
Container ID: docker://d5164225f71a6b50c4380556979bc7c00fe38613ad16ecd1faeace017bfd1bf9
Image: 368971480733.dkr.ecr.eu-central-1.amazonaws.com/cloudia/ow_amq_wmq_loop:OW_AMQ_WMQ_LOOP-v0.1.1624-SADe_06_2020
Image ID: docker-pullable://368971480733.dkr.ecr.eu-central-1.amazonaws.com/cloudia/ow_amq_wmq_loop@sha256:a27e221ef8d6a4840a060eec390727f88e0927e33ebbe03766e7ff45eab607fc
Ports: 8443/TCP, 8084/TCP
Host Ports: 0/TCP, 0/TCP
Command:
/bin/sh
-c
cd /opt/cloudia/ppr/ppr-timetableout-camadapter-v01; java -DSERVICE_CONTAINER=ppr-timetableout-camadapter-v01 -Dspring.profiles.active=cloudia-abn -Dcom.ibm.mq.cfg.useIBMCipherMappings=false -Dspring.config.location=file:/opt/config/app-config.yaml,file:/opt/config/env.yaml,file:/opt/cloudia-config-secret/cloudia-config-secret.yaml,file:/opt/mq-config-secret/mq-connection.yaml,classpath:config/application-generic-aws.yml,classpath:config/application-utils-aws.yml -Xms1000m -Xmx1000m -jar /opt/ow-amq-wmq-loop.jar
State: Running
Started: Thu, 11 Jun 2020 01:21:42 +0200
Ready: True
Restart Count: 0
Limits:
cpu: 400m
memory: 1000Mi
Requests:
cpu: 100m
memory: 500Mi
Liveness: http-get http://:8084/PPR/Timetableout/CAMAdapter/V01/health delay=200s timeout=10s period=80s #success=1 #failure=5
Readiness: http-get http://:8084/PPR/Timetableout/CAMAdapter/V01/health delay=220s timeout=5s period=50s #success=1 #failure=3
Environment:
STAKATER_PPR_TIMETABLEOUT_CAMADAPTER_V01_CM_CONFIGMAP: 0f49f7a571f36e29f7482eff42e568097fd377e5
STAKATER_MQ_CONFIG_SECRET_SECRET: a715da5e437d474638e45bafc8ba7d755da423a5
STAKATER_CLOUDIA_TRUSTSTORE_CM_CONFIGMAP: da39a3ee5e6b4b0d3255bfef95601890afd80709
STAKATER_CLOUDIA_CONFIG_SECRET_SECRET: 52370aed941e0d62c296351cfe5d258b8128949f
Mounts:
/etc/certs from certs-volume (rw)
/opt/cloudia-config-secret from cloudia-config-volume (ro)
/opt/cloudia/ppr/ppr-timetableout-camadapter-v01 from configdir (rw)
/opt/config from appconfig-volume (rw)
/opt/mq-config-secret from mq-config-volume (ro)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-8rts8 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady True
PodScheduled True
Volumes:
configdir:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
appconfig-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: ppr-timetableout-camadapter-v01-cm
Optional: false
certs-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: cloudia-truststore-cm
Optional: false
cloudia-config-volume:
Type: Secret (a volume populated by a Secret)
SecretName: cloudia-config-secret
Optional: false
mq-config-volume:
Type: Secret (a volume populated by a Secret)
SecretName: mq-config-secret
Optional: false
default-token-8rts8:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-8rts8
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events: <none>
Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.2", GitCommit:"59603c6e503c87169aea6106f57b9f242f64df89", GitTreeState:"clean", BuildDate:"2020-01-18T23:30:10Z", GoVersion:"go1.13.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.11-eks-af3caf", GitCommit:"af3caf6136cd355f467083651cc1010a499f59b1", GitTreeState:"clean", BuildDate:"2020-03-27T21:51:36Z", GoVersion:"go1.12.17", Compiler:"gc", Platform:"linux/amd64"}
@jia2 I noticed that your pods have EmptyDir
volumes:
Volumes:
configdir:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
Descheduler will do some checks before evicting pods, and pods with EmptyDir(local storage) won't pass these checks, so they won't be evicted, if you want descheduler to evict them, you can add this annotation descheduler.alpha.kubernetes.io/evict
.
You can reference to this section for more information.
The other option is to run the descheduler with the --evict-local-storage-pods
CLI option. This will enable evicting pods that have local storage.
Thank @lixiang233 and @seanmalloy for your hints.
/kind documentation /remove-kind bug /close
@jia2 I'm closing this issue because this is expected behavior. By default pods with local storage will not be evicted by the descheduler. Feel free to reopen this issue or post in sig-scheduling
on k8s Slack if you need further assistance. Thanks!
@seanmalloy: Closing this issue.
Here my kubernetes version
version of descheduler
I have duplicated pods in one worker node and run descheduler with version v0.10.0 as a job, however, it didn't work as expected. I set the log level to 9 and checked the logs and found out the response of request doesn't contain the list of complete pods in the queried node. For example, for the request
https://172.20.0.1:443/api/v1/pods?fieldSelector=spec.nodeName%3Dip-10-107-193-192.eu-central-1.compute.internal%2Cstatus.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded
there are only two pods in the response, but actually there are 39 pods running in this node.But In the logs I can find the response with http 200 status code. I0218 14:55:44.684433 1 duplicates.go:50] Processing node: "ip-10-107-193-192.eu-central-1.compute.internal" I0218 14:55:44.707379 1 round_trippers.go:443] GET https://172.20.0.1:443/api/v1/pods?fieldSelector=spec.nodeName%3Dip-10-107-193-192.eu-central-1.compute.internal%2Cstatus.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded 200 OK in 22 milliseconds
I'm not sure, if the pods list is completely written in log or not.
When I run
kubectl proxy
on my localmachine und open urlhttp://localhost:8001/api/v1/pods?fieldSelector=spec.nodeName=ip-10-107-193-192.eu-central-1.compute.internal,status.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded
in a browser, it took definitive much longer than 22 milliseconds, but I got the complete list with 39 pods. Can I increase the timeout value for this request to wait a little longer?Thanks