pygmystack / pygmy

the pygmy stack is a container stack for local development
MIT License
25 stars 12 forks source link

[bug] Sometimes a rogue ssh-agent container is created which causes ssh-agent to stop working #355

Open tallytarik opened 2 years ago

tallytarik commented 2 years ago

Describe the bug If your SSH key is invalid, running pygmy up will report no errors, but SSH will not work.

But actually, the ssh-agent becomes permanently broken, so you can't get SSH working again until you pygmy clean.

To Reproduce

  1. Do something silly to make your SSH key appear invalid to ssh-add, like chmod +x ~/.ssh/id_rsa
  2. pygmy up
  3. In a Lagoon project CLI container, test SSH - for example, drush @env ssh or ssh-add -l
  4. Fix your SSH key (for example, chmod -x ~/.ssh/id_rsa)
  5. pygmy up, or pygmy addkey --key ~/.ssh/id_rsa
  6. Repeat step 3 to test SSH

Expected behavior

Output Apologies for the wall of output here...

When the SSH key is broken, pygmy up doesn't show the identity line, which is easy to miss. It doesn't show any errors.

Successfully started amazeeio-dnsmasq
Successfully started amazeeio-haproxy
Successfully started amazeeio-mailhog
Successfully started amazeeio-ssh-agent
Already connected amazeeio-haproxy to amazeeio-network
Already connected amazeeio-mailhog to amazeeio-network
Already connected amazeeio-ssh-agent to amazeeio-network
[... service tests ...]

pygmy status:

[*] amazeeio-mailhog: Running as container amazeeio-mailhog
[*] amazeeio-haproxy: Running as container amazeeio-haproxy
[*] amazeeio-dnsmasq: Running as container amazeeio-dnsmasq
[ ] amazeeio-ssh-agent-add-key is not running
[*] Resolv Linux Resolver is properly connected
[... service tests ...]

Note a couple of things here:

  1. There is no line that says amazeeio-ssh-agent is running, but it actually is.
  2. There is an error amazeeio-ssh-agent-add-key is not running. This is an odd one because AFAIK the -add-key container isn't usually running permanently, and it doesn't usually show up in pygmy status output.

docker ps -a:

CONTAINER ID   IMAGE                  COMMAND                  CREATED          STATUS                      PORTS                                                                          NAMES
7130a0938ca4   pygmystack/ssh-agent   "/run.sh ssh-add /ro…"   3 minutes ago    Exited (1) 3 minutes ago                                                                                   amazeeio-ssh-agent-add-key
67c8335e5189   pygmystack/ssh-agent   "/run.sh ssh-agent"      38 minutes ago   Up 3 minutes                                                                                               amazeeio-ssh-agent
c958602c6dc6   pygmystack/mailhog     "MailHog"                38 minutes ago   Up 3 minutes                80/tcp, 8025/tcp, 0.0.0.0:1025->1025/tcp, :::1025->1025/tcp                    amazeeio-mailhog
f5d2bb7b7cc2   pygmystack/haproxy     "/app/docker-entrypo…"   38 minutes ago   Up 3 minutes                0.0.0.0:80->80/tcp, :::80->80/tcp, 0.0.0.0:443->443/tcp, :::443->443/tcp       amazeeio-haproxy
142418c95523   pygmystack/dnsmasq     "dnsmasq -k --log-fa…"   38 minutes ago   Up 3 minutes                0.0.0.0:6053->53/tcp, 0.0.0.0:6053->53/udp, :::6053->53/tcp, :::6053->53/udp   amazeeio-dnsmasq

You can see that there's a amazeeio-ssh-agent-add-key container which exited. Normally this container doesn't exist (I assume pygmy deletes it after successfully adding a key).

When I fix the SSH key and run pygmy up, the output is the same as before - no identity line, but also no error.

If I run pygmy addkey with no parameters, there is no output (is this intentional? I assumed it would act like pygmy up with the default key location, but maybe it doesn't?)

If I run pygmy addkey --key ~/.ssh/id_rsa:

container already created, or namespace is already taken

pygmy down:

Successfully stopped amazeeio-ssh-agent-add-key
Successfully stopped amazeeio-dnsmasq
Successfully stopped amazeeio-haproxy
Successfully stopped amazeeio-mailhog

Note that it doesn't say that it has stopped the amazeeio-ssh-agent container. It hasn't! If I check with docker ps -a, I can see all of the other containers are stopped, but amazeeio-ssh-agent is still running.

pygmy up, again, reports the same output as before. The pygmy addkey command also fails with the same error.

It seems that at this point, there is no way to recover from this state without deleting containers. The ssh-agent is permanently broken.

To fix it, you can either:

Then run pygmy up and it should all work - ssh-agent starts properly, your key is added properly.

tallytarik commented 2 years ago

Ok, this is weird. I think there's a race condition somewhere...

I just ran through those steps again, to verify the problem happens in that sequence, and I didn't get the rogue amazeeio-ssh-agent-add-key container.

There's still no error that the SSH key is invalid, which would be nice to fix. But unlike what I described above, I could fix my key permissions, then run pygmy addkey --key ~/.ssh/id_rsa, and it started working without having to recreate containers.

So it definitely seems like the rogue amazeeio-ssh-agent-add-key container sends pygmy into that unrecoverable state. But you're not guaranteed to get that rogue container with my above steps.

To test the race condition theory a bit more, I just ran through the steps again, and ran pygmy up a few times in a row. Most of the time, this is the output:

Already Running amazeeio-dnsmasq
Already Running amazeeio-haproxy
Already Running amazeeio-mailhog
Already Running amazeeio-ssh-agent
Already connected amazeeio-haproxy to amazeeio-network
Already connected amazeeio-mailhog to amazeeio-network
Already connected amazeeio-ssh-agent to amazeeio-network

Still no identity line (again, expected, because the SSH key is broken). Nothing about -add-key. But then, on a subsequent pygmy up, with no changes in between...

Already Running amazeeio-dnsmasq
Already Running amazeeio-haproxy
Already Running amazeeio-mailhog
Already connected amazeeio-haproxy to amazeeio-network
Already connected amazeeio-mailhog to amazeeio-network
Successfully connected amazeeio-ssh-agent-add-key to amazeeio-network
Already connected amazeeio-ssh-agent-add-key to amazeeio-network

A surprise amazeeio-ssh-agent-add-key appears! At this point, docker ps -a shows that rogue container which has exited. pygmy status now shows [ ] amazeeio-ssh-agent-add-key is not running. It's now in the unrecoverable state I described above.

fubarhouse commented 2 years ago

Going to try to address this one... it might be causing me a bit of grief too. Open to ideas - thank you both for the analysis.

fubarhouse commented 2 years ago

366 should fix this - all good signs right now.

Feel free to test, will revisit at some point tomorrow.

fubarhouse commented 2 years ago

SSH key validation is active on the master branch - if you're happy to compile. Will make it to the next release. However, the underlying problem remains.

A passphrase-protected SSH key is not yet supported, but at least an invalid key won't pass this validation now.

Leaving this open.

tallytarik commented 2 years ago

Nice! I'll test that out soon.

Although, I think the rogue amazeeio-ssh-agent-add-key container may actually be a different issue, caused by a separate race condition.

I've been using pygmy-go for a couple of months now, so have probably run pygmy up about a hundred times. Of those, maybe once or twice I've seen the rogue amazeeio-ssh-agent-add-key container that I describe above...

But that's with a totally valid SSH key! So, I said earlier that this happened with an invalid SSH key, but I'm not sure that's right. It seems like a race condition that's actually entirely random.

christopher-hopper commented 2 years ago

I have run into this issue when specifying a key to use at the command-line with the --key flag. I am using macOS and Docker Desktop.

I have multiple available keys and need to use the "ed25519" key specifically with Lagoon and pygmy.

# ls -1 ~/.ssh/id_*
/Users/chopper/.ssh/id_ed25519
/Users/chopper/.ssh/id_ed25519.pub
/Users/chopper/.ssh/id_rsa
/Users/chopper/.ssh/id_rsa.pub

I cannot reliably reproduce the issue. The issue happened for me after running these commands, on two occassions.

pygmy clean
pygmy up --key ~/.ssh/id_ed25519

pygmy status showed the ssh-agent not working and docker ps -a showed amazeeio-ssh-agent-add-key container that I had to remove to fix the issue with docker rm amazeeio-ssh-agent-add-key.

If I come across steps to reliably reproduce I will come back.

tallytarik commented 2 years ago

@fubarhouse I've renamed this issue to capture the actual problem I was seeing. There may well have been a problem with key validation, but I'm fairly sure it's not related. (as an aside, I've tested the latest version and I can see the validation working well)

To recap, the problem is that every so often, pygmy up will create a rogue/duplicate amazeeio-ssh-agent-add-key container.

There is no error thrown, but it causes ssh-agent to effectively stop working. For example, in a GovCMS PaaS scaffold project (where the CLI container has volumes_from: [container:amazeeio-ssh-agent]), the CLI container no longer has a SSH key.

This seemed like a race condition that I triggered quite rarely, so I built a script to test it. I'm running this on Ubuntu 20.04:

#!/bin/bash

# This script runs `pygmy up`, and checks whether there is a rogue addkey
# container by checking the output of `docker ps -a`. If there is, it will
# record it. Otherwise, run `pygmy down` and try again.

for i in $(seq 1 100)
do
  pygmy up >/dev/null 2>&1

  if docker ps -a | grep -q 'add-key'
  then
    echo "${i}: ROGUE ADDKEY CONTAINER FOUND"
    echo " "
    docker ps -a
    exit 1
  else
    echo "${i}: All good."
  fi

  pygmy down >/dev/null 2>&1
  sleep 1
done

I ran this a few times while testing and it usually failed within 50 runs:

1: All good.
2: All good.
3: ROGUE ADDKEY CONTAINER FOUND

CONTAINER ID   IMAGE                                                                      COMMAND                  CREATED          STATUS                              PORTS                                                                          NAMES
57a2138c0b31   pygmystack/ssh-agent                                                       "/run.sh ssh-add /ro…"   1 second ago     Exited (0) Less than a second ago                                                                                  amazeeio-ssh-agent-add-key
6280b2bb1220   pygmystack/mailhog                                                         "MailHog"                37 seconds ago   Up Less than a second               80/tcp, 8025/tcp, 0.0.0.0:1025->1025/tcp, :::1025->1025/tcp                    amazeeio-mailhog
1487f8e3d259   pygmystack/haproxy                                                         "/app/docker-entrypo…"   37 seconds ago   Up 1 second                         0.0.0.0:80->80/tcp, :::80->80/tcp, 0.0.0.0:443->443/tcp, :::443->443/tcp       amazeeio-haproxy
081de7613670   pygmystack/dnsmasq                                                         "dnsmasq -k --log-fa…"   38 seconds ago   Up 1 second                         0.0.0.0:6053->53/tcp, 0.0.0.0:6053->53/udp, :::6053->53/tcp, :::6053->53/udp   amazeeio-dnsmasq
e5a97e302023   pygmystack/ssh-agent                                                       "/run.sh ssh-agent"      38 seconds ago   Up 2 seconds                                                                                                       amazeeio-ssh-agent
6e92e8d60545   project_chrome                                                             "/usr/local/bin/entr…"   4 days ago       Exited (137) 21 seconds ago                                                                                        project_chrome_1
84d0052d9735   project_nginx                                                              "/sbin/tini -- /lago…"   6 weeks ago      Exited (0) 31 seconds ago                                                                                          project_nginx_1
7501a22322cc   project_php                                                                "/sbin/tini -- /lago…"   7 weeks ago      Exited (0) 31 seconds ago                                                                                          project_php_1
4c5146e51e65   project_cli                                                                "/sbin/tini -- /lago…"   7 weeks ago      Exited (137) 21 seconds ago                                                                                        project_cli_1
85401fdba13a   project_mariadb                                                            "/sbin/tini -- /lago…"   7 weeks ago      Exited (0) 31 seconds ago                                                                                          project_mariadb_1
2299907b714d   project_solr                                                               "/sbin/tini -- /lago…"   7 weeks ago      Exited (143) 31 seconds ago                                                                                        project_solr_1
48b4e5b27f70   project_redis                                                              "/sbin/tini -- /lago…"   7 weeks ago      Exited (0) 20 seconds ago                                                                                          project_redis_1

The top container in the list is the extra one which causes the failure. At this point, I can docker rm the container, run pygmy up again, and it resumes working.

Hopefully this helps! 🥴