Open tallytarik opened 2 years ago
Ok, this is weird. I think there's a race condition somewhere...
I just ran through those steps again, to verify the problem happens in that sequence, and I didn't get the rogue amazeeio-ssh-agent-add-key
container.
There's still no error that the SSH key is invalid, which would be nice to fix. But unlike what I described above, I could fix my key permissions, then run pygmy addkey --key ~/.ssh/id_rsa
, and it started working without having to recreate containers.
So it definitely seems like the rogue amazeeio-ssh-agent-add-key
container sends pygmy into that unrecoverable state. But you're not guaranteed to get that rogue container with my above steps.
To test the race condition theory a bit more, I just ran through the steps again, and ran pygmy up
a few times in a row. Most of the time, this is the output:
Already Running amazeeio-dnsmasq
Already Running amazeeio-haproxy
Already Running amazeeio-mailhog
Already Running amazeeio-ssh-agent
Already connected amazeeio-haproxy to amazeeio-network
Already connected amazeeio-mailhog to amazeeio-network
Already connected amazeeio-ssh-agent to amazeeio-network
Still no identity line (again, expected, because the SSH key is broken). Nothing about -add-key
. But then, on a subsequent pygmy up
, with no changes in between...
Already Running amazeeio-dnsmasq
Already Running amazeeio-haproxy
Already Running amazeeio-mailhog
Already connected amazeeio-haproxy to amazeeio-network
Already connected amazeeio-mailhog to amazeeio-network
Successfully connected amazeeio-ssh-agent-add-key to amazeeio-network
Already connected amazeeio-ssh-agent-add-key to amazeeio-network
A surprise amazeeio-ssh-agent-add-key
appears! At this point, docker ps -a
shows that rogue container which has exited. pygmy status
now shows [ ] amazeeio-ssh-agent-add-key is not running
. It's now in the unrecoverable state I described above.
Going to try to address this one... it might be causing me a bit of grief too. Open to ideas - thank you both for the analysis.
Feel free to test, will revisit at some point tomorrow.
SSH key validation is active on the master branch - if you're happy to compile. Will make it to the next release. However, the underlying problem remains.
A passphrase-protected SSH key is not yet supported, but at least an invalid key won't pass this validation now.
Leaving this open.
Nice! I'll test that out soon.
Although, I think the rogue amazeeio-ssh-agent-add-key
container may actually be a different issue, caused by a separate race condition.
I've been using pygmy-go for a couple of months now, so have probably run pygmy up
about a hundred times. Of those, maybe once or twice I've seen the rogue amazeeio-ssh-agent-add-key
container that I describe above...
But that's with a totally valid SSH key! So, I said earlier that this happened with an invalid SSH key, but I'm not sure that's right. It seems like a race condition that's actually entirely random.
I have run into this issue when specifying a key to use at the command-line with the --key
flag. I am using macOS and Docker Desktop.
I have multiple available keys and need to use the "ed25519" key specifically with Lagoon and pygmy.
# ls -1 ~/.ssh/id_*
/Users/chopper/.ssh/id_ed25519
/Users/chopper/.ssh/id_ed25519.pub
/Users/chopper/.ssh/id_rsa
/Users/chopper/.ssh/id_rsa.pub
I cannot reliably reproduce the issue. The issue happened for me after running these commands, on two occassions.
pygmy clean
pygmy up --key ~/.ssh/id_ed25519
pygmy status
showed the ssh-agent not working and docker ps -a
showed amazeeio-ssh-agent-add-key
container that I had to remove to fix the issue with docker rm amazeeio-ssh-agent-add-key
.
If I come across steps to reliably reproduce I will come back.
@fubarhouse I've renamed this issue to capture the actual problem I was seeing. There may well have been a problem with key validation, but I'm fairly sure it's not related. (as an aside, I've tested the latest version and I can see the validation working well)
To recap, the problem is that every so often, pygmy up
will create a rogue/duplicate amazeeio-ssh-agent-add-key
container.
There is no error thrown, but it causes ssh-agent
to effectively stop working. For example, in a GovCMS PaaS scaffold project (where the CLI container has volumes_from: [container:amazeeio-ssh-agent]
), the CLI container no longer has a SSH key.
This seemed like a race condition that I triggered quite rarely, so I built a script to test it. I'm running this on Ubuntu 20.04:
#!/bin/bash
# This script runs `pygmy up`, and checks whether there is a rogue addkey
# container by checking the output of `docker ps -a`. If there is, it will
# record it. Otherwise, run `pygmy down` and try again.
for i in $(seq 1 100)
do
pygmy up >/dev/null 2>&1
if docker ps -a | grep -q 'add-key'
then
echo "${i}: ROGUE ADDKEY CONTAINER FOUND"
echo " "
docker ps -a
exit 1
else
echo "${i}: All good."
fi
pygmy down >/dev/null 2>&1
sleep 1
done
I ran this a few times while testing and it usually failed within 50 runs:
1: All good.
2: All good.
3: ROGUE ADDKEY CONTAINER FOUND
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
57a2138c0b31 pygmystack/ssh-agent "/run.sh ssh-add /ro…" 1 second ago Exited (0) Less than a second ago amazeeio-ssh-agent-add-key
6280b2bb1220 pygmystack/mailhog "MailHog" 37 seconds ago Up Less than a second 80/tcp, 8025/tcp, 0.0.0.0:1025->1025/tcp, :::1025->1025/tcp amazeeio-mailhog
1487f8e3d259 pygmystack/haproxy "/app/docker-entrypo…" 37 seconds ago Up 1 second 0.0.0.0:80->80/tcp, :::80->80/tcp, 0.0.0.0:443->443/tcp, :::443->443/tcp amazeeio-haproxy
081de7613670 pygmystack/dnsmasq "dnsmasq -k --log-fa…" 38 seconds ago Up 1 second 0.0.0.0:6053->53/tcp, 0.0.0.0:6053->53/udp, :::6053->53/tcp, :::6053->53/udp amazeeio-dnsmasq
e5a97e302023 pygmystack/ssh-agent "/run.sh ssh-agent" 38 seconds ago Up 2 seconds amazeeio-ssh-agent
6e92e8d60545 project_chrome "/usr/local/bin/entr…" 4 days ago Exited (137) 21 seconds ago project_chrome_1
84d0052d9735 project_nginx "/sbin/tini -- /lago…" 6 weeks ago Exited (0) 31 seconds ago project_nginx_1
7501a22322cc project_php "/sbin/tini -- /lago…" 7 weeks ago Exited (0) 31 seconds ago project_php_1
4c5146e51e65 project_cli "/sbin/tini -- /lago…" 7 weeks ago Exited (137) 21 seconds ago project_cli_1
85401fdba13a project_mariadb "/sbin/tini -- /lago…" 7 weeks ago Exited (0) 31 seconds ago project_mariadb_1
2299907b714d project_solr "/sbin/tini -- /lago…" 7 weeks ago Exited (143) 31 seconds ago project_solr_1
48b4e5b27f70 project_redis "/sbin/tini -- /lago…" 7 weeks ago Exited (0) 20 seconds ago project_redis_1
The top container in the list is the extra one which causes the failure. At this point, I can docker rm
the container, run pygmy up
again, and it resumes working.
Hopefully this helps! 🥴
Describe the bug If your SSH key is invalid, running
pygmy up
will report no errors, but SSH will not work.But actually, the ssh-agent becomes permanently broken, so you can't get SSH working again until you
pygmy clean
.To Reproduce
ssh-add
, likechmod +x ~/.ssh/id_rsa
pygmy up
drush @env ssh
orssh-add -l
chmod -x ~/.ssh/id_rsa
)pygmy up
, orpygmy addkey --key ~/.ssh/id_rsa
Expected behavior
pygmy up
shows an error that your SSH key is invalid (ssh-add
itself will show an error in my example case, so pygmy should pass through this error)pygmy status
shows that there are no ssh-agent identitiespygmy up
(oraddkey
) and it shows that the key has been addedpygmy status
shows my SSH keyssh-add -l
shows the right identity)Output Apologies for the wall of output here...
When the SSH key is broken,
pygmy up
doesn't show the identity line, which is easy to miss. It doesn't show any errors.pygmy status
:Note a couple of things here:
amazeeio-ssh-agent
is running, but it actually is.amazeeio-ssh-agent-add-key is not running
. This is an odd one because AFAIK the-add-key
container isn't usually running permanently, and it doesn't usually show up inpygmy status
output.docker ps -a
:You can see that there's a
amazeeio-ssh-agent-add-key
container which exited. Normally this container doesn't exist (I assume pygmy deletes it after successfully adding a key).When I fix the SSH key and run
pygmy up
, the output is the same as before - no identity line, but also no error.If I run
pygmy addkey
with no parameters, there is no output (is this intentional? I assumed it would act likepygmy up
with the default key location, but maybe it doesn't?)If I run
pygmy addkey --key ~/.ssh/id_rsa
:pygmy down
:Note that it doesn't say that it has stopped the
amazeeio-ssh-agent
container. It hasn't! If I check withdocker ps -a
, I can see all of the other containers are stopped, butamazeeio-ssh-agent
is still running.pygmy up
, again, reports the same output as before. Thepygmy addkey
command also fails with the same error.It seems that at this point, there is no way to recover from this state without deleting containers. The ssh-agent is permanently broken.
To fix it, you can either:
docker rm
the exitedamazeeio-ssh-agent-add-key
containerpygmy clean
(which deletes all of the containers)Then run
pygmy up
and it should all work - ssh-agent starts properly, your key is added properly.