habitat-sh / habitat

Modern applications with built-in automation
https://www.habitat.sh
Apache License 2.0
2.6k stars 314 forks source link

v0.56.0 [Err: 4] secret key mismatch on sup restart studio #5169

Open eeyun opened 6 years ago

eeyun commented 6 years ago

I can't replicate this 100% of the time, however I have run into it on a couple of occasions

Ubuntu 16.04 Linux 4.4.0-124 Chroot Studio

I experienced this in the builder dev environment. I entered a fresh chroot based studio. On loading in, I ran hab pkg install results/<origin>-<packagename>.hart which installed successfully. However for some reason this caused the supervisor to exit. I restarted the supervisor and attempted to get it's status before continuing but was met with the referenced error.

* Install of habitat/builder-minio/2018-05-11T00-2924Z/20180605190642 complete with 2 new packages installed.
[1]+ Done                 hab sup run "$@" > /hab/sup/default/sup.log 2>&1  (wd: /)
(wd now: /src)
[2][default:/src:0]# sup-run
[3][default:/src:0]# hab sup status
XXX
XXX  [Err: 4] secret key mismatch
XXX
[4][default:/src:0]# hab svc status
XXX
XXX  [Err: 4] secret key mismatch
XXX

At this point I attempted to term the supervisor and was met with another error

[5][default:/src:0]# hab sup term
hab-sup(MR)[components/sup/src/manager/mod.rs:314:64]: Failed to send a signal to the child process

I haven't had a chance to dig into this yet but when I experienced this yesterday there were definitely child processes that stayed running and had to be kill -9'd from the host.

In attempting to exit the studio I see further errors.

[6][default:/src:0]# exit
logout
kill: can't kill pid 6171: No such process
Warning: '/hab/pkgs/core/hab-studio/0.56.0/20180530235913/libexec/busybox kill 6171' failed with status 1

Checking ps auxf from the host afterwards shows the launcher, the sup, and a process I had attempted to start inside the studio still running on the host. These also had to be kill -9'd

baumanj commented 6 years ago

It looks like the issue here is that communication with the supervisor now requires a shared secret. By default, this lives in hab/sup/default/CTL_SECRET (see https://www.habitat.sh/blog/2018/05/changes-in-the-0.56.0-supervisor/).

However, what can be confusing is that when run inside a studio, there can be a separate CTL_SECRET for each studio instance. For example, in the repro I was looking at there was an instance here: /hab/studios/home--hab--habitat_jumpstart--national-parks--habitat/hab/sup/default/CTL_SECRET.

I was able to determine which supervisor was running by looking for the lock file:

hab$ sudo find /hab/ -path "*/sup/default/LOCK"
/hab/studios/home--hab--habitat_jumpstart--national-parks--habitat/hab/sup/default/LOCK

We can confirm it's still running:

hab$ ps -p $(cat /hab/studios/home--hab--habitat_jumpstart--national-parks--habitat/hab/sup/default/LOCK)
  PID TTY          TIME CMD
20284 pts/0    00:00:02 hab-launch

This tells me that the running supervisor was started from a studio with a root at

~/habitat_jumpstart/national-parks/habitat

Which is probably an error, as the studio root should've been

~/habitat_jumpstart/national-parks

If we run any supervisor commands other than inside the ~/habitat_jumpstart/national-parks/habitat studio, we'll get the error:

~ 
hab$ hab sup status
✗✗✗
✗✗✗ [Err: 4] secret key mismatch
✗✗✗

But if we supply the supervisor secret from the ~/habitat_jumpstart/national-parks/habitat, it works:

hab$ HAB_CTL_SECRET=$(sudo cat /hab/studios/home--hab--habitat_jumpstart--national-parks--habitat/hab/sup/default/CTL_SECRET) hab sup status
No services loaded.

I'll look at improving the error message and documentation.

eeyun commented 6 years ago

Unless I'm missing something, which is totally possible, I don't think this is a documentation bug. The studios on linux have guaranteed behavior of running multiples in any directory including recursive structures including single steps apart on those dir structures. E.G. This behavior worked before:

  1. cd /foo/bar
  2. hab studio enter
  3. hab svc status
  4. exit
  5. cd /foo
  6. hab studio enter
  7. hab svc status

No hab studio rm-ing nothing, you should be able to enter and manipulate those studios. You should be able to exit one, and jump back up to the other in fact. Is what youre seeing other studios getting mounted nested into each other? Or is it that when you exit one of those studios the supervisor process doesn't get terminated correctly?

raskchanky commented 6 years ago

I can confirm @eeyun's comment that the steps he listed definitely worked before 0.56. From a user's point of view, no one should care about shared secrets or anything else. It should Just Work™.

reset commented 6 years ago

We probably want to have some sort of escape mechanism for the Supervisor while in the Studio which disables the secret comparison in the connection handshake. It's really not necessary to authenticate you since it's a contained dev environment. I think that'd resolve this one

baumanj commented 6 years ago

This works for me on 0.56.0:

08:08:13 AM jbauman@ubuntu:~
➤ cd foo/bar/

08:08:16 AM jbauman@ubuntu:~/f/bar
➤ hab studio enter
…
[1][default:/src:0]# hab svc status
No services loaded.
[2][default:/src:0]# exit
logout

08:08:31 AM jbauman@ubuntu:~/f/bar
hab studio enter ran for 6723 ms
➤ cd ~/foo

08:08:45 AM jbauman@ubuntu:~/foo
…
[1][default:/src:0]# hab svc status
No services loaded.
[3][default:/src:0]# exit
logout

08:10:36 AM jbauman@ubuntu:~/foo
➤ hab -V
hab 0.56.0/20180530234036
baumanj commented 6 years ago

@eeyun: is there anything more to do with this issue? Please re-open if so.

eeyun commented 6 years ago

If the original bug of studios getting confused about having the appropriate CTL_SECRET still exists, then this is still an issue. There shouldn't be a circumstance where our users can end up in a state where the studio is unusable. If the supervisor in the studio dies and can't be restarted due to a different studio secret existing in the filesystem then that is a bug.

baumanj commented 6 years ago

Ah, ok. I think I've got it now. I had the repo wrong. Here's how I can get the behavior that I think you're referring to:

➤ mkdir /tmp/{foo,bar}

➤ hab studio -r /tmp/foo/ enter
…
[1][default:/src:0]# hab sup status
No services loaded.

Then in another shell (without exiting the studio rooted at /tmp/foo/ :

➤ hab studio -r /tmp/bar/ enter
…
[1][default:/src:0]# hab sup status
✗✗✗
✗✗✗ [Err: 4] secret key mismatch
✗✗✗
[1]+  Done                    hab sup run "$@" > /hab/sup/default/sup.log 2>&1  (wd: /)
(wd now: /src)

Is that right? Thanks for following up, @eeyun.

baumanj commented 6 years ago

I think I've gotten to the bottom of this one, finally.

If you enter a studio when there's already a supervisor running outside that studio but on the same host, the second supervisor crashes (note the [1]+ Done hab sup run "$@" > /hab/sup/default/sup.log 2>&1 (wd: /) output above) with a bind failure:

hab-sup(ER)[components/sup/src/error.rs:450:9]: Butterfly error: Cannot bind to port: Os { code: 98, kind: AddrInUse, message: "Address already in use" }

Then, with no supervisor running inside the studio, any supervisor commands such as hab sup status will attempt to connect to whatever supervisor is running: hence the key mismatch.

We need to do two things here:

  1. Communicate to the user when the supervisor fails to start
  2. Find a way to allow multiple supervisors to run on the same host conveniently or at least make it clear why the supervisor can't run and what to do about it

(1) would just be a matter of checking the exit code of hab sup run (called by the sup-run script helper which the studio executes by default on enter), but that exit code is determined by the launcher, which exits 0 even if the supervisor exits ERR_NO_RETRY_EXCODE (86). I have a fix for that working and should have a PR up shortly.

(2) Requires a bit of discussion about how we want the user experience to work. I'll file a separate issue for that and link back here.

christophermaier commented 6 years ago

Yeah, we need to do a bit of UX thinking about this. @ryankeairns and @fnichol will likely have some good input.

ryankeairns commented 6 years ago

Just let me know the time and place :)

sbrar7 commented 5 years ago

Hello, I am getting stuck because of the above issue and even logged a ticket here: https://forums.habitat.sh/t/err-4-secret-key-mismatch/1210

Can someone pls help how do I get rid of this. I am using hab 0.83.0/20190712231714

bixu commented 4 years ago

I'm also seeing this. Could the use of the --reuse flag when building be a possible culprit?

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. We value your input and contribution. Please leave a comment if this issue still affects you.