oracle / vagrant-projects

Vagrant projects for Oracle products and other examples
Universal Permissive License v1.0
930 stars 474 forks source link

Host key verification failed #414

Closed pguerin3 closed 2 years ago

pguerin3 commented 2 years ago

I have Oracle Virtualbox installed (not libvirt). To start an OLCNE in Vagrant, I go to the ~/vagrant-projects/OLCNE directory, and 'vagrant up --provider=virtualbox' The worker VM is created then during the creation of the master VM, the error appears:

    master1: Host key verification failed.

Sounds like something to do with the config of SSH, so I have set the SSH config file to lessen the checking:

StrictHostKeyChecking no

Unfortunately this doesn't help, and I still get the same build error.

Environment

Host OS: Fedora 35 Kernel version (for Linux host): Linux 5.15.11-200.fc35.x86_64 Vagrant version: 2.2.16 Vagrant provider: 6.1.32 r149290 Vagrant project: ~/vagrant-projects/OLCNE

Additional information

> vagrant ssh-config
Host worker1
  HostName 127.0.0.1
  User vagrant
  Port 2222
  UserKnownHostsFile /dev/null
  StrictHostKeyChecking no
  PasswordAuthentication no
  IdentityFile /home/me/vagrant-projects/OLCNE/.vagrant/machines/worker1/virtualbox/private_key
  IdentitiesOnly yes
  LogLevel FATAL

Host master1
  HostName 127.0.0.1
  User vagrant
  Port 2200
  UserKnownHostsFile /dev/null
  StrictHostKeyChecking no
  PasswordAuthentication no
  IdentityFile /home/me/vagrant-projects/OLCNE/.vagrant/machines/master1/virtualbox/private_key
  IdentitiesOnly yes
  LogLevel FATAL

The tail of the build log is here:

    master1: ===== Bootstrap the Oracle Linux Cloud Native Environment Platform Agent on all nodes =====
    master1:     /etc/olcne/bootstrap-olcne.sh --secret-manager-type file --olcne-node-cert-path /etc/olcne/pki/production/node.cert --olcne-ca-path /etc/olcne/pki/production/ca.cert --olcne-node-key-path /etc/olcne/pki/production/node.key --olcne-component api-server
    master1:     ssh 192.168.56.101 /etc/olcne/bootstrap-olcne.sh --secret-manager-type file --olcne-node-cert-path /etc/olcne/pki/production/node.cert --olcne-ca-path /etc/olcne/pki/production/ca.cert --olcne-node-key-path /etc/olcne/pki/production/node.key --olcne-component agent
    master1: Returned a non-zero code: 255
    master1: Last output lines:
    master1: Host key verification failed.
    master1: See /var/tmp/cmd_lUL0V.log for details
The SSH command responded with a non-zero exit status. Vagrant
assumes that this means the command failed. The output for this command
should be in the log above. Please read the output to determine what
went wrong.
pguerin3 commented 2 years ago

I don't know if this is relevant to the problem, but the link below says this:

Fedora uses the libvirt family of tools as its virtualization solution.

https://docs.fedoraproject.org/en-US/quick-docs/getting-started-with-virtualization/index.html

And I'm not using libvirt.....

scoter-oracle commented 2 years ago

Seems to be not related to that. As already said on the other thread (issue #413), you can configure with one is the default provider. That said, you can try to reproduce the same issue with a simpler VM (just Oracle Linux for example) and see if it works. In any case you should check the log details, as specified for example on the output above:

master1: See /var/tmp/cmd_lUL0V.log for details

pguerin3 commented 2 years ago

When I try to inspect /var/tmp/ there are no cmd* files present, so I can't inspect the log. Seems those are cleared out immediately.

Trying the other projects: This works for ~/vagrant-projects/OracleLinux/8

vagrant up --provider=virtualbox

This also works for the same project

EXTEND=container-tools vagrant up --provider=virtualbox

So this proves that the Vagrant/Virtualbox combination works.

Back to ~/vagrant-projects/OLCNE, there is no obvious problem with the key management for the build of worker1:

==> worker1: Waiting for machine to boot. This may take a few minutes...
    worker1: SSH address: 127.0.0.1:2222
    worker1: SSH username: vagrant
    worker1: SSH auth method: private key
    worker1: 
    worker1: Vagrant insecure key detected. Vagrant will automatically replace
    worker1: this with a newly generated keypair for better security.
    worker1: 
    worker1: Inserting generated public key within guest...
    worker1: Removing insecure key from the guest if it's present...
    worker1: Key inserted! Disconnecting and reconnecting using new SSH key...
==> worker1: Machine booted and ready!

Also no problem with the build of master1:

==> master1: Waiting for machine to boot. This may take a few minutes...
    master1: SSH address: 127.0.0.1:2200
    master1: SSH username: vagrant
    master1: SSH auth method: private key
    master1: 
    master1: Vagrant insecure key detected. Vagrant will automatically replace
    master1: this with a newly generated keypair for better security.
    master1: 
    master1: Inserting generated public key within guest...
    master1: Removing insecure key from the guest if it's present...
    master1: Key inserted! Disconnecting and reconnecting using new SSH key...
==> master1: Machine booted and ready!

Even after a reboot, the following error is still appearing:

    master1:     ssh 192.168.56.101 /etc/olcne/bootstrap-olcne.sh --secret-manager-type file --olcne-node-cert-path /etc/olcne/pki/production/node.cert --olcne-ca-path /etc/olcne/pki/production/ca.cert --olcne-node-key-path /etc/olcne/pki/production/node.key --olcne-component agent
    master1: Returned a non-zero code: 255
    master1: Last output lines:
    master1: Host key verification failed.
    master1: See /var/tmp/cmd_R0VLC.log for details
The SSH command responded with a non-zero exit status. Vagrant
assumes that this means the command failed. The output for this command
should be in the log above. Please read the output to determine what
went wrong.
AmedeeBulle commented 2 years ago

The IPs appearing in the above logs are not the ones defined this repository.

To rule out any issue with the changes you made, can you revert to the original Vagrantfile / scripts of this project, create a /etc/vbox/networks.conf file with:

* 192.168.0.0/16
* fe80::/64

and retry?

The Host key verification failed error happens in a guest-to-guest communication, this should not interact with the host (other than using the vboxnet bridge defined in VirtualBox), the deployment should be exactly the same, whatever the host OS is.

Note that:

SSH keys for node-to-node communication are generated by the first node coming online and removed by the operator/master node; a re-run will generate a new pair and keys won't match...

The restriction on the IP range for hostonly networks has been introduced in VirtualBox 6.1.28, we still need update this project (either by documenting the use of /etc/vbox/networks.conf or changing the IP range used).

pguerin3 commented 2 years ago

I'm using the information specified here: https://www.virtualbox.org/manual/UserManual.html#network_hostonly

So to allow everything, I'm specifying 0.0.0.0/0.

> cat /etc/vbox/networks.conf
* 0.0.0.0/0 ::/0

However I will try your suggestions soon.

AmedeeBulle commented 2 years ago

Allowing everything is fine as well. My point is that you aren't using an exact copy of this project and I can't reproduce your issue.

pguerin3 commented 2 years ago

After doing a git fetch to remove all the changes to the Vagrant file, and rerunning I've decided that the root cause is that I don't have enough physical memory on my laptop to allow Vagrant to finish without error.

AmedeeBulle commented 2 years ago

The bare minimum is one master and one worker node with 3GB of memory each, so you need to be able to run two VirtualBox VMs for a total of 6GB free memory for the VMs (no over-commit!)...

If you don't run ISTIO, you could decrease the memory allocation, but the VMs tend to be less responsive and you might experience install failures due to timeouts.

The OLCNE Vagrant project should run a laptop with 16GB of RAM (providing you don't have other major workloads). Anything less than that might be difficult.

Worth noticing, you should configure an Oracle Container Registry mirror in your region; using the default one will likely cause timeout issues at install time.