Automate the restart of the node robot when new flist is promoted to the hub

rkhamis commented 6 years ago

Issue migrated from [https://api.github.com/repos/zero-os/0-robot/issues/204](), opened by @zaibon

We need to find a mechanism that allow the node robot to restart its container automatically when a new version of the flist is published.

Currently a user need to connect to the node, and manually stop the node robot container, in order for 0-OS to respawn a new container and thus bringing the new code for the node robot.

This need to be automated cause we don't have the ability to do this to all nodes of the grid (too many, no access to 0-core interface, ....)

rkhamis commented 6 years ago

commented by @grimpy When we detect a new flist version is available we should put the robot in some kind of maintenance mode in this mode the schedulers stop scheduling new tasks and the API stops accepting new connections. (just like in shutdown mode) we wait X amount of time for the remaning tasks to be completed at this point zrobot just stops. In production this means core-0 will respawn the z-robot in a new container with the latest version.

rkhamis commented 6 years ago

commented by @zaibon This sounds great. Question though, how would the robot knows he is being run from an flist and that it should monitor it on the hub ? also how does he know which flist to monitor ?

rkhamis commented 6 years ago

commented by @grimpy @zaibon I was considering the mode node would imply this. I was thinking he should always check the latest, not sure its possible from the hub api to read the symlink information to know the link has moved?

rkhamis commented 6 years ago

commented by @zaibon I think it is fine, the symlink that point the latest build has the same modification date that the target, so we can see when something changes

rkhamis commented 6 years ago

*commented by @zaibon to continue this discussion, I was thinking about a way for the robot to also update itself when new templates are published.

In order for this to work, we need to solve #161 first so we can start the robot container with some mounted flist that contains the required templates. Then the robot would monitor both its own flist and all the flist that are mounted and provide templates.

Then when any of those change, we need to somehow stop the robot container, so it would be automatically restarted and the new template flist would be download and mounted. We need to fully stop the container, cause update/change mounted flist is not supported by 0-os yet.

Once we have these feature in place, we can have a fully automated update procedure for the the full grid, at least for the management layer.

Now technically, @muhamadazmy is there a way for the robot to stop the container he's running in directly ? Or Should I connect to the local robot using the client and stop the container from there ? I'm just asking cause if I can avoid to put too much knowledge of where the robot run in the code itself that would be nice. But if not, then well, I can live with it I guess.*

rkhamis commented 6 years ago

*commented by @muhamadazmy @zaibon There is no way a container can stop itself (yet) and I am not sure this is needed. Technically the only way to stop the container is by killing the coreX process, which I think is possible from the container but may be disabled later by ignoring the kill signals.

To update a container we have 2 approaches:

Full stop/start of a container when a new flist is available. While this is a very easy solution it has the implication of stopping all the running processes inside the container (down time)
Another more sophisticated approach (but we need to experiment with it first) is the hot update of the 0-fs metadata database on the fly. We probably will have to pause the fs access for few milliseconds until the metdata backend is replaced but it shouldn't be a problem. The processes will just continue from where they have left of.

This approach of course will not force the processes that needs a restart to use the new code. The processes restart will still be needed to use the new libraries. We will still need to think about this part.

Note: one solution for this problem, all processes that needs a restart can check the flist version (a specific file we keep on the flist) and they kill themselves once a new version is available, and leave the respawn to the container startup.toml.*

rkhamis commented 6 years ago

*commented by @zaibon

one solution for this problem, all processes that needs a restart can check the flist version (a specific file we keep on the flist) and they kill themselves once a new version is available, and leave the respawn to the container startup.toml.

This is exactly how I currently implemented in the robot, but I'm afraid with won't be enough. How can I know that the fs metadata has already been updated when I check the flist ? maybe the robot will see the updated flist before 0-fs. in which case I would restart while the filesystem is still not updated. Maybe we could send a signal to the process group of a container everytime the fs is updated. so any process that need to have hot update can handle that signal, so we are sure this is done in the right order*

muhamadazmy commented 5 years ago

The robot can watch the hub for new flist release, if found. the robot can use the node client to terminate it's container. Since the container is protected, a new robot container will be created that will use the latest flist. I think this is the simplest solution possible

this can be implemented as part of the node service monitoring.

threefoldtecharchive / 0-robot

Automate the restart of the node robot when new flist is promoted to the hub #20