serokell / deploy-rs

A simple multi-profile Nix-flake deploy tool.
Other
1.4k stars 101 forks source link

Magic rollback is not working if previous version was not deployed with deploy-rs #86

Open petrkozorezov opened 3 years ago

petrkozorezov commented 3 years ago

I'm trying to migrate my deployments to deploy-rs. And for one of my nodes (deployed with nixops) after the first attempt to deploy with deploy-rs something went wrong, network was lost, but magic rollback failed with ERROR [activate] Error de-activating due to another error waiting for confirmation, oh no...: Failed to run command for re-activating the last generation: No such file or directory (os error 2)

As I understand it because deploy-rs-activate is calling in the old profile. https://github.com/serokell/deploy-rs/blob/9e405fbc5ab5bacbd271fd78c6b6b6877c4d9f8d/src/bin/activate.rs#L162

I know that it's a corner case, but it's inconvenient.

λ cat -p /var/log/deploy/deploy_2021-05-09_09-06-38.log
DEBUG [deploy] Checking for flake support
INFO [deploy] Evaluating flake in .
INFO [deploy] The following profiles are going to be deployed:
[router.system]
user = "root"
ssh_user = "root"
path = "/nix/store/mblwh0vfb61lsrgfskaa4vplilnky0a2-activatable-nixos-system-nixos-21.05.20210506.6358647"
hostname = "router"
ssh_opts = []

INFO [deploy::push] Building profile `system` for node `router`
DEBUG [deploy::push] Copying profile `system` to node `router`
INFO [deploy::deploy] Activating profile `system` for node `router`
DEBUG [deploy::deploy] Constructed activation command: /nix/store/mblwh0vfb61lsrgfskaa4vplilnky0a2-activatable-nixos-system-nixos-21.05.20210506.6358647/activate-rs --debug-logs --log-dir /var/log/deploy --temp-path '/tmp' activate '/nix/store/mblwh0vfb61lsrgfskaa4vplilnky0a2-activatable-nixos-system-nixos-21.05.20210506.6358647' '/nix/var/nix/profiles/system' --confirm-timeout 30 --magic-rollback --auto-rollback
DEBUG [deploy::deploy] Constructed wait command: /nix/store/mblwh0vfb61lsrgfskaa4vplilnky0a2-activatable-nixos-system-nixos-21.05.20210506.6358647/activate-rs --debug-logs --log-dir /var/log/deploy --temp-path '/tmp' wait '/nix/store/mblwh0vfb61lsrgfskaa4vplilnky0a2-activatable-nixos-system-nixos-21.05.20210506.6358647'
INFO [deploy::deploy] Creating activation waiter
DEBUG [deploy::deploy] Wait command ended
ERROR [deploy] Failed to deploy profile: Waiting over SSH resulted in a bad exit code: Some(255)
# cat /var/log/deploy/activate_activate_2021-05-09_06-06-48.log
INFO [activate] Activating profile
DEBUG [activate] Running activation script
INFO [activate] Activation succeeded!
INFO [activate] Magic rollback is enabled, setting up confirmation hook...
DEBUG [activate] Ensuring parent directory exists for canary file
DEBUG [activate] Creating canary file
DEBUG [activate] Creating notify watcher
INFO [activate] Waiting for confirmation event...
ERROR [activate] Error waiting for confirmation event: Timeout elapsed for confirmation
WARN [activate] De-activating due to error
DEBUG [activate] Listing generations
DEBUG [activate] Removing generation entry   88   2021-05-09 06:06:48
WARN [activate] Removing generation by ID 88
INFO [activate] Attempting to re-activate the last generation
ERROR [activate] Error de-activating due to another error waiting for confirmation, oh no...: Failed to run command for re-activating the last generation: No such file or directory (os error 2)
# cat /var/log/deploy/activate_wait_2021-05-09_06-06-47.log
INFO [activate] Waiting for confirmation event...
INFO [activate] Found canary file, done waiting!
notgne2 commented 3 years ago

I've gotten annoyed a lot by this too, but I think basically this is semi-intentional.

This issue used to not exist, instead on rollback we would run the same activation command on the previous generation as would be used on the newer generation. We chose to replace that behaviour with the current behaviour as sometimes the activation command may change, and assuming that it doesn't is potentially dangerous. I don't remember if the previous functionality is anywhere in this Git history, and can't find the conversations regarding it right now, but deploy (the predecessor of deploy-rs) made that assumption: https://github.com/serokell/deploy/blob/master/deploy.sh#L148-L149.

What I think might be acceptible in this situation is to prefer /deploy-rs-activate in the previous generation, but if unavailable, fallback to using the one out of the current generation.

Shados commented 1 year ago

What I think might be acceptible in this situation is to prefer /deploy-rs-activate in the previous generation, but if unavailable, fallback to using the one out of the current generation.

As someone who just got bitten migrating a system over to using deploy, this would definitely be nicer.