Open preflightsiren opened 3 years ago
I manually kubectl apply
d a rollingUpgrade
, this time setting drainTimeout: 600
and it succeeded. doing the same thing with drainTimeout: -1
also errored.
apiVersion: upgrademgr.keikoproj.io/v1alpha1
kind: RollingUpgrade
metadata:
annotations:
app.kubernetes.io/managed-by: instance-manager
instancemgr.keikoproj.io/upgrade-scope: ew1p-fcp-k2-xxxxxxxx-sh-r5-2xlarge-eu-west-1c
name: rollup-nodes-0098ee98d43054385-seb
namespace: instance-manager
spec:
asgName: ew1p-fcp-k2-xxxxxxxx-sh-r5-2xlarge-eu-west-1c
nodeIntervalSeconds: 10
postDrain:
waitSeconds: 90
postDrainDelaySeconds: 45
region: eu-west-1
strategy:
drainTimeout: 600
mode: eager
ok, I think I've figured out the chain of events to replicate the failure.
strategy.drainTimeout
has a default value of -1
(I'm assuming to represent infinity.)
https://github.com/keikoproj/upgrade-manager/blob/master/README.mdstrategy.drainTimeout * time.Second
d.Timeout
is checked if == 0
https://github.com/kubernetes/kubectl/blob/master/pkg/drain/drain.go#L277 otherwise is set to the value passed in (-1
by default)context.WithTimeout(d.getContext(), globalTimeout)
is called https://github.com/kubernetes/kubectl/blob/master/pkg/drain/drain.go#L282 which exits immediately for timeouts <= 0
I've created a playground https://play.golang.org/p/K9KmZ0vkX4f to demonstrate
package main
import ( "context" "fmt" "time" )
func main() { fmt.Println("Hello world.") errChan := make(chan error) globalTimeout := 0 ctx, cancel := context.WithTimeout(context.TODO(), time.Duration(globalTimeout) * time.Second) defer cancel()
go errFunc(errChan)
select {
case <-ctx.Done():
fmt.Println("done")
case err := <-errChan:
fmt.Println(err)
}
time.Sleep(100 * time.Second)
}
func errFunc(c chan error) { time.Sleep(5 * time.Second) c <- fmt.Errorf("timer called") }
some potential solutions:
time.Duration(math.MaxInt64)
when <= 0
> 0
kubectl drain
docs
--timeout=0s: The length of time to wait before giving up, zero means infinite
looks like in the latest code this has been partially addressed by https://github.com/keikoproj/upgrade-manager/blob/controller-v2/controllers/upgrade.go#L121 which works in conjunction to: https://github.com/keikoproj/upgrade-manager/blame/controller-v2/main.go#L93 to set a default drainTimeout
of 900s
setting drainTimeout to 0 also fails with the same "global timeout: -1s" - no idea why yet. worked around the issue by setting drainTimeout to maxInt32 (2147483647)
// validating the DrainTimeout value
if strategy.DrainTimeout == 0 {
r.Spec.Strategy.DrainTimeout = -1
} else if strategy.DrainTimeout < -1 {
err := fmt.Errorf("%s: Invalid value for startegy DrainTimeout - %d", r.Name, strategy.MaxUnavailable.IntVal)
return false, err
that'll do it....
@shreyas-badiger
We should handle the case of 0
drain timeout, drain timeout should never be 0, either more than 0 (e.g. explicitly set to a number of seconds) or -1 which would mean never timeout.
Seems we are not setting a default value if drainTimeout is not referenced (which defaults to 0)
@shreyas-badiger
We should handle the case of
0
drain timeout, drain timeout should never be 0, either more than 0 (e.g. explicitly set to a number of seconds) or -1 which would mean never timeout.Seems we are not setting a default value if drainTimeout is not referenced (which defaults to 0)
I think it's more than the kubectl drain package you're using uses 0 for infinity instead of -1. Negative values are essentially "immediately timeout"
Is this a BUG REPORT or FEATURE REQUEST?: Bug report
What happened: something very similar to #225 During a change to an instanceGroup, a RollingUpgrade is created. The RollingUpgrade correctly identifies the ec2 instances affected, attempts to drain the node and fails with an error. Deleting the RollingUpgrade and restarting the process consistently fails in the same way.
executing the same drain eg.
kubectl drain <node> --delete-local-data --ignore-daemonsets
works as expected from a local shell.What you expected to happen: kubectl drain evicts the pods as normal.
How to reproduce it (as minimally and precisely as possible): Not sure how to reproduce it outside of my environment, but seems to affect certain instance groups consistently.
Anything else we need to know?: the error appears to be produced immediately (about 1 second after the drain messages) there are no PDBs being used on this cluster, all pods have a
termiantionGracePeriodSeconds=30
Environment:
Other debugging information (if applicable):