keikoproj / lifecycle-manager

Graceful AWS scaling event on Kubernetes using lifecycle hooks
Apache License 2.0
94 stars 28 forks source link

drain-interval does not wait between drain failures #185

Open omgrr opened 8 months ago

omgrr commented 8 months ago

Is this a BUG REPORT or FEATURE REQUEST?: BUG REPORT

What happened: The retry-interval does not work. I believe it got removed accidentally but it appears not to be used anywhere. https://github.com/keikoproj/lifecycle-manager/blob/master/pkg/service/nodes.go#L72

When testing with a blocking PDB (just setting maxUnavailable: 0) to simulate a drain timeout, I would expect the total time to equal drain_timeout * drain_retries * drain-interval. However it becomes really obvious with a high drain-interval that its not being used.

What you expected to happen:

I would have expected the total time to drain a node before abandoning, to be drain_timeout * drain_retries * drain-interval.

How to reproduce it (as minimally and precisely as possible):

The easiest way is probably just to alter this test, set to the retry interval to something very large, and run make test. Since it should fail, it should sleep for the drain interval but it doesn't. And actually even the way the test is written the whole suite should fail because the Makefile is running the tests with -timeout 30s.

Alternatively you can make a blocking PDB with maxUnavailable: 0, and set the drain-interval to something very large, the retries large, and the drain timeout small.

Anything else we need to know?: