pusher / k8s-spot-rescheduler

Tries to move K8s Pods from on-demand to spot instances
Apache License 2.0
311 stars 42 forks source link

Failed to update metrics on spot nodes #46

Closed komljen closed 6 years ago

komljen commented 6 years ago

Rescheduler fails to update metrics on spot nodes because there are 2 non replicated pods running on them. In logs I have this, and it keeps repeating:

I0524 07:01:08.275301       1 rescheduler.go:207] Starting node processing.
E0524 07:01:08.356870       1 rescheduler.go:383] Failed to update metrics on spot node ip-10-4-50-15.eu-west-1.compute.internal: pr-333/es-master-st-cluster-eu-west-1a-0 is not replicated
E0524 07:01:08.356904       1 rescheduler.go:383] Failed to update metrics on spot node ip-10-4-52-234.eu-west-1.compute.internal: monitoring/mon-exporter-node-bczlg is not replicated
E0524 07:01:08.356919       1 rescheduler.go:256] Failed to get pods for consideration: monitoring/mon-exporter-node-gxxph is not replicated
E0524 07:01:08.356932       1 rescheduler.go:256] Failed to get pods for consideration: pr-333/es-data-st-cluster-eu-west-1a-0 is not replicated
I0524 07:01:08.356942       1 rescheduler.go:289] Finished processing nodes.

At this point rescheduler is pretty much useless.

JoelSpeed commented 6 years ago

Hi @komljen thanks for raising this issue.

I think you're right about the metrics being wrong, I believe changing the first boolean to true on this line will fix the metrics updating https://github.com/pusher/k8s-spot-rescheduler/blob/d93c65cb68d2a2c9e2f0d8f374bc402140c3dabe/rescheduler.go#L381

Based on https://github.com/kubernetes/autoscaler/blob/513b7d672b6d76dd8bf075041d8353e275fdc7a2/cluster-autoscaler/utils/drain/drain.go#L182

Since we aren't actually using this group of pods for deletion we can just retrieve all of the pods here for the metrics.

E0524 07:01:08.356919 1 rescheduler.go:256] Failed to get pods for consideration: monitoring/mon-exporter-node-gxxph is not replicated E0524 07:01:08.356932 1 rescheduler.go:256] Failed to get pods for consideration: pr-333/es-data-st-cluster-eu-west-1a-0 is not replicated

Are you also suggesting that the Spot rescheduler should be able to delete non-replicated pods? Is this a behaviour you would want?

komljen commented 6 years ago

Thanks for quick reply! I will try to make this change and let you know.

Are you also suggesting that the Spot rescheduler should be able to delete non-replicated pods? Is this a behaviour you would want?

I think it is ok to leave it like this, or if possible to make it configurable.

komljen commented 6 years ago

Ok, metrics are good now, but mon-exporter-node-gxxph pod is actually a daemonset, so it shouldn't be reported as not replicated?

komljen commented 6 years ago

Another one, I have one spot and one on-demand instance, kube-dns is deployed on both of them, but I get this in logs:

E0601 14:15:02.256462       1 rescheduler.go:383] Failed to update metrics on spot node ip-10-2-3-219.eu-west-1.compute.internal: kube-system/kube-dns-7785f4d7dc-qhn7p is not replicated
E0601 14:15:02.256487       1 rescheduler.go:256] Failed to get pods for consideration: kube-system/kube-dns-7785f4d7dc-vmjqw is not replicated
kimxogus commented 6 years ago

I'm having a same issue. log says kube-dns and my other pods managed by deployments(from helm charts) are not replicated. I would be happy if you make it configurable to delete all pods.

komljen commented 6 years ago

We could probably close this issue as now we have the ability to move non replicated pods?

JoelSpeed commented 6 years ago

Agreed, thanks @komljen