Closed MoLow closed 1 year ago
I was able to ssh into the machine but fetching git failed with fatal: detected dubious ownership in repository
I was on my way out so I did not have the time to continue investigating
I believe this was fixed by https://github.com/nodejs/build/pull/3255 . Please reopen if not.
Thanks @MoLow!
I did apply https://github.com/nodejs/build/pull/3255 to everywhere I have access, but I don't think I got all of the workers.
After that, I still see a lot of different but related failures in Jenkins:
12:57:36 stderr: Warning: the ECDSA host key for 'github.com' differs from the key for the IP address '140.82.121.4'
12:57:36 Offending key for IP in /Users/iojs/.ssh/known_hosts:1
12:57:36 Matching host key in /Users/iojs/.ssh/known_hosts:7
12:57:36 Exiting, you have requested strict checking.
12:57:36 Host key verification failed.
I also don't think it was applied to containers, and I don't know what needs to be done for those.
I did apply #3255 to everywhere I have access, but I don't think I got all of the workers.
After that, I still see a lot of different but related failures in Jenkins:
12:57:36 stderr: Warning: the ECDSA host key for 'github.com' differs from the key for the IP address '140.82.121.4' 12:57:36 Offending key for IP in /Users/iojs/.ssh/known_hosts:1 12:57:36 Matching host key in /Users/iojs/.ssh/known_hosts:7 12:57:36 Exiting, you have requested strict checking. 12:57:36 Host key verification failed.
(This is a drive-by comment, in an unfortunate piece of timing I'm on PTO today and Monday and am not on my work computer which has the ssh keys needed to get into any of the build machines/infra.)
When I added https://github.com/nodejs/build/pull/3212 I was very conservative -- that PR added the keys for GitHub but didn't remove any existing entries. It's complicated by ssh writing new entries into known_hosts
with the IP address (in this case 140.82.121.4) which I think makes its very hard to use ansible.builtin.known_hosts
to remove the existing entries as the IP address could be anything in the range(s) operated by GitHub: https://api.github.com/meta
Maybe it's possible to remove the entires with the deprecated key via ansible.builtin.lineinfile
.
(A non-optimal solution would be to delete the known_hosts
file and recreate it, but that will mean the playbook wouldn't be idempotent.)
@joaocgreis did https://github.com/nodejs/build/pull/3255 resolve the issue on the machines you ran it against. Based on @richardlau's comment I'm not sure if though it might not fix the issue?
EDIT: I guess other than errors that related to specific IPs
@richardlau if you happen to check in, if #3255 would have addressed the issue would runinng
ansible-playbook ansible/playbooks/jenkins/docker-host.yaml --limit "test-digitalocean-ubuntu1804_docker-x64-1" -vv
be expected to fixup the containers on test-digitalocean-ubuntu1804_docker-x64-1
The other question to @richardlau and other @nodejs/build members, in terms of public test workers are there any known_hosts that should be there for specifc ips? ie is there any reason we can't just delete all ip specific entries (for example if we need to manually clean up some machines) when we come across them? I don't think so but wanted to see if anybody else knew of any reason.
This seems to explain why the update was not made on the test-digitalocean-freebsd12-x64-X machines. I updated the 2 manually.
TASK [jenkins-worker : write github.com entry in known_hosts] **************************************************************************************************************************************************************
fatal: [test-digitalocean-freebsd12-x64-1]: FAILED! => {"msg": "Failed to set permissions on the temporary files Ansible needs to create when becoming an unprivileged user (rc: 1, err: chmod: invalid file mode: A+user:iojs:rx:allow\n}). For information on working around this, see https://docs.ansible.com/ansible-core/2.13/user_guide/become.html#risks-of-becoming-an-unprivileged-user"}
In this job https://ci.nodejs.org/job/node-test-commit-linux/nodes=ubuntu1804-64/51177/console it seems to complain about the EDCSA hostkey even though that whas not supposed to have changed.
13:46:46 stderr: Warning: the ECDSA host key for 'github.com' differs from the key for the IP address '140.82.114.3'
13:46:46 Offending key for IP in /home/iojs/.ssh/known_hosts:8
13:46:46 Matching host key in /home/iojs/.ssh/known_hosts:13
13:46:46 Exiting, you have requested strict checking.
13:46:46 Host key verification failed.
13:46:46 fatal: Could not read from remote repository.
disabling test-equinix_mnx-ubuntu1804-x64-1
where it ran to see if job can run on other machines
Same problem on [test-digitalocean-ubuntu1804-x64-1](https://ci.nodejs.org/computer/test-digitalocean-ubuntu1804-x64-1)
but logging into the machine I don't see any entries that have an IP associated with them.
I guess maybe the warning is a red herring as the match is on a different key which although the text does not obviously show it as associated with an IP, it must be.
I'm done until Monday, I guess the question is if we should use @richardlau suggestion to remove the known_hosts files and recreate at least temporarily as it seems like we still have large number of machines with a broken config.
(This is a drive-by comment, in an unfortunate piece of timing I'm on PTO today and Monday and am not on my work computer which has the ssh keys needed to get into any of the build machines/infra.) ... Maybe it's possible to remove the entries with the deprecated key via
ansible.builtin.lineinfile
.
Untested PR for the above: https://github.com/nodejs/build/pull/3256
The other question to @richardlau and other @nodejs/build members, in terms of public test workers are there any known_hosts that should be there for specifc ips? ie is there any reason we can't just delete all ip specific entries (for example if we need to manually clean up some machines) when we come across them? I don't think so but wanted to see if anybody else knew of any reason.
I think for the test machines we only need the keys for github.com in known_hosts
. The release machines also need the key to upload to the dist server. I think the benchmark machines need the key for the benchmark data machine.
I ran https://github.com/nodejs/build/pull/3256 on all possible hosts.
I also manually updated:
test-softlayer-alpine311_container-x64-1
test-softlayer-alpine312_container-x64-1
test-digitalocean-alpine311_container-x64-1
test-digitalocean-alpine312_container-x64-1
test-digitalocean-alpine311_container-x64-2
test-digitalocean-alpine312_container-x64-2
test-equinix_mnx-ubuntu1804-x64-1
test-equinix-ubuntu2004_sharedlibs_container-arm64-1
test-equinix-ubuntu2004_sharedlibs_container-arm64-2
test-equinix-ubuntu2004_sharedlibs_container-arm64-3
test-equinix-ubuntu1804_sharedlibs_container-arm64-1
test-equinix-ubuntu1804_sharedlibs_container-arm64-2
test-equinix-ubuntu1804_sharedlibs_container-arm64-3
test-equinix-ubuntu2004_container-arm64-1
test-equinix-ubuntu1804_container-arm64-1
test-equinix-debian10_container-armv7l-1
test-equinix-ubuntu2004_container-armv7l-1
test-equinix-centos7_container-arm64-1
test-equinix-debian10_container-armv7l-2
test-osuosl-ubuntu2004_sharedlibs_container-arm64-1
test-osuosl-ubuntu1804_sharedlibs_container-arm64-1
test-osuosl-ubuntu1804_container-arm64-1
test-osuosl-debian10_container-armv7l-1
test-osuosl-centos7_container-arm64-1
test-osuosl-ubuntu2004_container-arm64-1
test-osuosl-rhel8_container-arm64-1
test-osuosl-ubuntu2004_container-armv7l-1
Another batch of manual updates:
test-digitalocean-ubuntu1804_sharedlibs_container-x64-4
test-digitalocean-ubuntu1804_sharedlibs_container-x64-6
test-digitalocean-ubuntu1804_sharedlibs_container-x64-8
test-digitalocean-ubuntu1804_sharedlibs_container-x64-2
test-digitalocean-rhel8_arm_cross_container-x64-2
test-digitalocean-ubi81_container-x64-2
test-digitalocean-ubuntu1804_sharedlibs_container-x64-
test-digitalocean-ubuntu1804_arm_cross_container-x64-2
test-digitalocean-ubuntu1804_sharedlibs_container-x64-9
test-digitalocean-ubuntu1804_sharedlibs_container-x64-3
test-digitalocean-ubi81_container-x64-1
test-digitalocean-rhel8_arm_cross_container-x64-1
test-digitalocean-ubuntu1804_sharedlibs_container-x64-5
test-digitalocean-ubuntu1804_arm_cross_container-x64-1
test-digitalocean-ubuntu1804_sharedlibs_container-x64-1
test-digitalocean-ubuntu1604_arm_cross_container-x64-1
test-digitalocean-ubuntu1804_sharedlibs_container-x64-7
Fixed up test-softlayer-ubuntu1804_sharedlibs_container-x64-4
Fixed up test-softlayer-ubuntu1804_sharedlibs_container-x64-1
Fixed up test-softlayer-ubuntu1804_sharedlibs_container-x64-3
and test-softlayer-ubuntu1804_sharedlibs_container-x64-5
Fixed up test-softlayer-ubuntu1804_sharedlibs_container-x64-2
@richardlau if you happen to check in, if #3255 would have addressed the issue would runinng
ansible-playbook ansible/playbooks/jenkins/docker-host.yaml --limit "test-digitalocean-ubuntu1804_docker-x64-1" -vv
be expected to fixup the containers on
test-digitalocean-ubuntu1804_docker-x64-1
I missed this yesterday -- no, https://github.com/nodejs/build/pull/3255 won't affect the docker hosts as the docker-host playbook doesn't run the jenkins-worker role. I overlooked the docker containers when I did https://github.com/nodejs/build/pull/3212. It may make sense to extract the github known hosts tasks into its own role which can be called from both playbooks in a similar fashion to the release-builder role, which writes the key for our dist server into known_hosts
for the release machines.
(e.g. here is how the docker
role calls the release-builder
role for each container: https://github.com/nodejs/build/blob/d428b08c38bae3f7fe093d6c1df529570c45268e/ansible/roles/docker/tasks/main.yml#L92-L101)
fixed up test-softlayer-ubi81_container-x64-1
resumed again - https://ci.nodejs.org/job/node-test-commit/60931/. Looks like only failures due to github last time were on test-softlayer-ubi81_container-x64-1 so here's hoping
https://ci.nodejs.org/job/node-test-commit/60931/ made it through :). I suspect there might still be a few machines that need fixing up since Jenkins seems to favor using the same machines.
fixed up
test-softlayer-ubi81_container-x64-1
@mhdawson what were the steps to fix it? I tried this with no success
I would ssh into the container host, then docker exec -it containerID /bin/bash and then update the key in .ssh/known_hosts, where containerID was the id for test-softlayer-ubi81_container-x64-1. When updating I would also remove all other rsa based keys.
Ah! so the answer I was looking for was "manually" :)
I am closing this as the issue seems to be fixed
It hit this on test-rackspace-centos7-x64-1 this morning (job ref). It seems a few more hosts need to be updated
I have updated the know_hosts
file on test-rackspace-centos7-x64-1 it had a key set by the ip and by hostname, so I removed the redundant key. fetch worked after that
I have updated the
know_hosts
file on test-rackspace-centos7-x64-1 it had a key set by the ip and by hostname, so I removed the redundant key. fetch worked after that
hmm FWIW I logged in just now and there were still 17 entries with the old key:
I've run the playbook from https://github.com/nodejs/build/pull/3256 and this has removed the 17 entries with the old key:
TASK [jenkins-worker : remove old github.com ssh keys] **********************************************************************************************************************************************************
[WARNING]: sftp transfer mechanism failed on [119.9.27.82]. Use ANSIBLE_DEBUG=1 to see detailed information
[WARNING]: scp transfer mechanism failed on [119.9.27.82]. Use ANSIBLE_DEBUG=1 to see detailed information
changed: [test-rackspace-centos7-x64-1] => (item=ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAq2A7hRGmdnm9tUDbO9IDSwBK6TbQa+PXYPCPy6rbTrTtw7PHkccKrpp0yVhp5HdEIcKr6pLlVDBfOLX9QUsyCOV0wzfjIJNlGEYsdlLJizHhbn2mUjvSAHQqZETYP81eFzLQNnPHt4EVVUh7VfDESU84KezmD5QlWpXLmvU31/yMf+Se8xhHTvKSCZIFImWwoG6mbUoWf9nzpIoaSjB+weqqUUmpaaasXVal72J+UX2B+2RPW3RcT0eOzQgqlJL3RKrTJvdsjE3JEAvGq3lGHSZXy28G3skua2SmVi/w4yCE6gbODqnTWlg7+wC604ydGXA8VJiS5ap43JXiUFFAaQ==) => {"ansible_loop_var": "item", "backup": "", "changed": true, "found": 17, "item": "ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAq2A7hRGmdnm9tUDbO9IDSwBK6TbQa+PXYPCPy6rbTrTtw7PHkccKrpp0yVhp5HdEIcKr6pLlVDBfOLX9QUsyCOV0wzfjIJNlGEYsdlLJizHhbn2mUjvSAHQqZETYP81eFzLQNnPHt4EVVUh7VfDESU84KezmD5QlWpXLmvU31/yMf+Se8xhHTvKSCZIFImWwoG6mbUoWf9nzpIoaSjB+weqqUUmpaaasXVal72J+UX2B+2RPW3RcT0eOzQgqlJL3RKrTJvdsjE3JEAvGq3lGHSZXy28G3skua2SmVi/w4yCE6gbODqnTWlg7+wC604ydGXA8VJiS5ap43JXiUFFAaQ==", "msg": "17 line(s) removed"}
[root@test-rackspace-centos7-x64-1 ~]# cat /home/iojs/.ssh/known_hosts
github.com ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIOMqqnkVzrm0SdG6UOoqKLsabgH5C9okWi0dh2l9GKJl
github.com ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBEmKSENjQEezOmxkZMy7opKgwFB9nkt5YRrYMjNuG5N87uRgg6CLrbo5wAdT/y6v0mKV0U2w0WZ2YB/++Tpockg=
github.com ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQCj7ndNxQowgcQnjshcLrqPEiiphnt+VTTvDP6mHBL9j1aNUkY4Ue1gvwnGLVlOhGeYrnZaMgRK6+PKCUXaDbC7qtbW8gIkhL7aGCsOr/C56SJMy/BCZfxd1nWzAOxSDPgVsmerOBYfNqltV9/hWCqBywINIR+5dIg6JTJ72pcEpEjcYgXkE2YEFXV1JHnsKgbLWNlhScqb2UmyRkQyytRLtL+38TGxkxCflmO+5Z8CSSNY7GidjMIZ7Q4zMjA2n1nGrlTDkzwDCsw+wqFPGQA179cnfGWOWRVruj16z6XyvxvjJwbz0wQZ75XK5tKSb7FNyeIEs4TT4jk+S4dhPeAUC5y+bDYirYgM4GC7uEnztnZyaVWQ7B381AK4Qdrwt51ZqExKbQpTUNn+EjqoTwvqNj4kqx5QUCI0ThS/YkOxJCXmPUWZbhjpCg56i+2aB6CmK2JGhn57K5mj0MNdBXA4/WnwH6XoPWJzK5Nyu2zB3nAZp+S5hpQs+p1vN1/wsjk=
20.205.243.166 ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBEmKSENjQEezOmxkZMy7opKgwFB9nkt5YRrYMjNuG5N87uRgg6CLrbo5wAdT/y6v0mKV0U2w0WZ2YB/++Tpockg=
[root@test-rackspace-centos7-x64-1 ~]#
Sorry I wasn't able to keep working on this on Friday.
The jenkins-workspace hosts (at least) also need the key from where the binary_tmp in use is stored, which might be any of the other jenkins-workspace hosts (this is configured with a variable in Jenkins).
I've merged https://github.com/nodejs/build/pull/3256 -- it's good enough for the "normal" CI hosts. I'll open a follow up PR to address the docker hosts and the containers on them.
All of the containers on test-digitalocean-ubuntu1804-docker-x64-2 seem to have reverted. They show as only having been up 6 hours so my guess is that after manual updates we need to do something so they will persists over a restart. @richardlau do you know if we need to be committing the container ?
eek. @mhdawson this would have been because I ran https://github.com/nodejs/build/pull/3265 against the docker hosts. That did remove the old ssh key, but I think what has happened is that it has now put the other keys (ecdsa-sha2-nistp256 and ssh-ed25519) and the key being returned isn't matching existing entries in the known_hosts
that is one of the other keys (e.g. the new rsa key -- I did check after running the playbook that the old rsa key was removed).
I did an experiment. Looking at test-softlayer-ubi81_container-x64-1, I looked at the last failing job that ran on it, https://ci.nodejs.org/job/node-test-commit-linux-containered/36856/nodes=ubi81_sharedlibs_openssl111fips_x64/console
23:02:26 stderr: Warning: the ECDSA host key for 'github.com' differs from the key for the IP address '140.82.114.4'
In the known_hosts file for that container (/home/iojs/test-softlayer-ubi81_container-x64-1/.ssh/known_hosts
on the host), I can see an entry for that IP address:
140.82.114.4 ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQCj7ndNxQowgcQnjshcLrqPEiiphnt+VTTvDP6mHBL9j1aNUkY4Ue1gvwnGLVlOhGeYrnZaMgRK6+PKCUXaDbC7qtbW8gIkhL7aGCsOr/C56SJMy/BCZfxd1nWzAOxSDPgVsmerOBYfNqltV9/hWCqBywINIR+5dIg6JTJ72pcEpEjcYgXkE2YEFXV1JHnsKgbLWNlhScqb2UmyRkQyytRLtL+38TGxkxCflmO+5Z8CSSNY7GidjMIZ7Q4zMjA2n1nGrlTDkzwDCsw+wqFPGQA179cnfGWOWRVruj16z6XyvxvjJwbz0wQZ75XK5tKSb7FNyeIEs4TT4jk+S4dhPeAUC5y+bDYirYgM4GC7uEnztnZyaVWQ7B381AK4Qdrwt51ZqExKbQpTUNn+EjqoTwvqNj4kqx5QUCI0ThS/YkOxJCXmPUWZbhjpCg56i+2aB6CmK2JGhn57K5mj0MNdBXA4/WnwH6XoPWJzK5Nyu2zB3nAZp+S5hpQs+p1vN1/wsjk=
This is GitHub's new ssh key. I removed it from the known_hosts
file and reran a build, https://ci.nodejs.org/job/node-test-commit-linux-containered/36857/nodes=ubi81_sharedlibs_openssl111fips_x64/consoleFull (I canceled it after it had finished the git checkout). Looking at the known_hosts
file again, a new entry has been created for the IP address:
140.82.114.4 ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBEmKSENjQEezOmxkZMy7opKgwFB9nkt5YRrYMjNuG5N87uRgg6CLrbo5wAdT/y6v0mKV0U2w0WZ2YB/++Tpockg=
This is GitHub's ecdsa-sha2-nistp256 key, instead of the rsa key that was there before.
So what I think has happened is that, prior to running the playbook from https://github.com/nodejs/build/pull/3265, the known_hosts
files for the containers only had one entry beginning github.com
and it was the ssh-rsa key (the new one). All entries after that were for the specific IP addresses that github.com was resolving to (on some hosts the IP address is hashed) and had the ssh-rsa key. So I think git operations over ssh were matching that github.com
line and then using that key and writing entries for the specific IP address.
After https://github.com/nodejs/build/pull/3265 there are now three entries in known_hosts
for github.com
(one for each of the different key types). Now ssh is writing new entries for the specific IP address with the ecdsa-sha2-nistp256
key. In the cases where github.com
gets resolved to an IP address the host has previously connected to before https://github.com/nodejs/build/pull/3265, ssh appears to be negotiating the ecdsa-sha2-nistp256
key but is then checking it against the ssh-rsa
key as that is what is in known_hosts
.
The key type selection by ssh is deterministic, but it was changed by https://github.com/nodejs/build/pull/3265 as instead of previously only negotiating the only key type that was there (ssh-rsa
), there is now all three of GitHub's supported key types and ssh is deterministically preferring ecdsa-sha2-nistp256
.
For now I've manually wiped out the known_hosts
file for the containers. I then reran the playbook from https://github.com/nodejs/build/pull/3265 to recreate the known_hosts
file and then started a new CI run, https://ci.nodejs.org/job/node-test-commit-linux-containered/36859/. I can see while this is running that new entries have been written to known_hosts
, all with the ecdsa-sha2-nistp256
key.
We've hit 'Host key verification failed.' on test-rackspace-win2012r2_vs2019-x64-6
on a recent CITGM job:
We've hit 'Host key verification failed.' on
test-rackspace-win2012r2_vs2019-x64-6
on a recent CITGM job:
@StefanStojanovic Could you take a look at test-rackspace-win2012r2_vs2019-x64-6
? We have Ansible tasks for updating the known_hosts file (latest update https://github.com/nodejs/build/pull/3265) but these aren't run for Windows (as the playbook runs a different set of roles).
@richardlau, sorry for just checking this now. Anyway, test-rackspace-win2012r2_vs2019-x64-6
had an outdated known_hosts file, so I updated it manually and it should work properly now. Regards.
https://ci.nodejs.org/job/node-test-pull-request/50585/console https://ci.nodejs.org/job/node-test-pull-request/50584/ https://ci.nodejs.org/job/node-test-pull-request/50583 https://ci.nodejs.org/job/node-test-pull-request/50582 etc
on
[test-ibm-ubuntu1804-x64-1](https://ci.nodejs.org/computer/test-ibm-ubuntu1804-x64-1)