giantswarm / inago

Inago orchestrates groups of units on fleet clusters
https://giantswarm.io/
Apache License 2.0
18 stars 2 forks source link

updating fails when only one slice exists #216

Closed Nesurion closed 8 years ago

Nesurion commented 8 years ago

I deployed a group with one slice and then tried updating it with: inagoctl update mygroup so using the default parameters: --max-growth=1 --min-alive=1 --ready-secs=30.

inago starts a second slice and waits until its up. Destroying the old slice however does not happen. It repeats the following step forever:

2016-04-08 09:23:22.868 | DEBUG    | context.Background: task: fetching state for task: 71d36a5c-e0cf-499e-8adf-405b530c22fd
2016-04-08 09:23:22.868 | DEBUG    | context.Background: task: found task: &task.Task{ActiveStatus:"started", Error:error(nil), FinalStatus:"", ID:"71d36a5c-e0cf-499e-8adf-405b530c22fd"}
2016-04-08 09:23:22.868 | DEBUG    | context.Background: task: does not have final status: &task.Task{ActiveStatus:"started", Error:error(nil), FinalStatus:"", ID:"71d36a5c-e0cf-499e-8adf-405b530c22fd"}

fleetctl list-units:

UNIT                    MACHINE                         ACTIVE  SUB
mygroup-bar@cb0.service c4f2e5ee.../172.17.8.101        active  running
mygroup-bar@e51.service c4f2e5ee.../172.17.8.101        active  running
mygroup-foo@cb0.service c4f2e5ee.../172.17.8.101        active  running
mygroup-foo@e51.service c4f2e5ee.../172.17.8.101        active  running
Nesurion commented 8 years ago

The update task waits forever for the old slice to be deleted, which never happens as the delete task is not scheduled. We found that breaking the loop in https://github.com/giantswarm/inago/blob/master/controller/update.go#L353 causes this error, because the outer loop iterates over the request slices (in our case just one) and breaks after scheduling the add.

This is only part of the story though. Simply removing the break causes the update to run forever in all cases.

Nesurion commented 8 years ago

After testing different numbers of slices again and syncing with @xh3b4sd, we discovered that the update failure was due to a broken fleet and not inago.