vagrant-libvirt / vagrant-libvirt

Vagrant provider for libvirt.
https://vagrant-libvirt.github.io/vagrant-libvirt/
MIT License
2.33k stars 497 forks source link

vagrant + libvirt network problems #1737

Open StPanning opened 1 year ago

StPanning commented 1 year ago

Hello, I have network problems with vagrant/libvirt.

after vagrant up I can vagrant ssh into my boxes. However, after some minutes I get vagrant@vagrant:/vagrant/ansible$ client_loop: send disconnect: Broken pipe After that the network becomes unresponsive for a while. Then vagrant ssh sort of works again, but now it is requesting a password and is not using the usual key-authentication.

vagrant ssh hangs
(base) murphy@tron:~/git/kubernetes_cluster/kubernetes_nodes$ vagrant ssh ansible
^C==> ansible: Waiting for cleanup before exiting...
Vagrant exited after cleanup due to external interrupt.
[...] after several tries
(base) murphy@tron:~/git/kubernetes_cluster/kubernetes_nodes$ vagrant ssh ansible
vagrant@192.168.121.38's password: 

This happens periodically, but the boxes are still running and responsive when I check via vnc. The natted interface has the correct IP-Address assigned. I'm using the libvirt provider, this are the messages I'm seeing there:

    tail -f /var/log/libvirtd
    2023-04-29 18:27:48.697+0000: 10254: error : virNetSocketReadWire:1793 : End of file while reading data: Input/output error
    2023-04-29 18:40:14.198+0000: 10254: error : virNetSocketReadWire:1793 : End of file while reading data: Input/output error
    2023-04-29 18:51:27.835+0000: 10254: error : virNetSocketReadWire:1793 : End of file while reading data: Input/output error

I'm not sure if the problem is libvirt/kvm or vagrant, since vagrant ssh also behaves strangely

here is my Vagrantfile:

  def private_kubenet(vm, ip_address)
    vm.network "private_network",
               type: "dhcp",
               ip: ip_address,
               libvirt__netmask: "255.255.255.0",
               libvirt__domain_name: "kubenet.local",
               libvirt__network_address: "192.168.200.0",
               libvirt__dhcp_start: "192.168.200.128",
               libvirt__network_name: "kubenet"
  end

  Vagrant.configure("2") do |config|
    # custom image
    config.vm.box = "ubuntu22.04" 

    config.vm.provider :libvirt do |libvirt|
      libvirt.driver = "kvm"
      libvirt.memory = 4096
      libvirt.cpus = 2
      libvirt.cpu_mode = "host-passthrough"
    end
    config.vm.provision "shell", path: "ansible_client_prep.sh"

    # ansible-host
    config.vm.define "ansible" do |ansible|
      private_kubenet(ansible.vm,"192.168.200.10")

      ansible.vm.provision "shell", inline: <<-SHELL
        [...]
      SHELL

    end

    config.vm.define "kube-master" do |kube_master|
      private_kubenet(kube_master.vm,"192.168.200.11")
    end
  end

`

Here is the config xml of the ansible vm:

<domain type='kvm' id='4'>
  <name>kubernetes_nodes_ansible</name>
  <uuid>ca593343-06e2-4016-9952-e74141910da6</uuid>
  <description>Source: /home/murphy/git/kubernetes_cluster/kubernetes_nodes/Vagrantfile</description>
  <memory unit='KiB'>4194304</memory>
  <currentMemory unit='KiB'>4194304</currentMemory>
  <vcpu placement='static'>2</vcpu>
  <resource>
    <partition>/machine</partition>
  </resource>
  <os>
    <type arch='x86_64' machine='pc-i440fx-jammy'>hvm</type>
    <boot dev='hd'/>
    <bootmenu enable='no'/>
  </os>
  <features>
    <acpi/>
    <apic/>
    <pae/>
  </features>
  <cpu mode='host-passthrough' check='none' migratable='on'/>
  <clock offset='utc'/>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>destroy</on_crash>
  <devices>
    <emulator>/usr/bin/qemu-system-x86_64</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2'/>
      <source file='/var/lib/libvirt/images/kubernetes_nodes_ansible.img' index='1'/>
      <backingStore type='file' index='2'>
        <format type='qcow2'/>
        <source file='/var/lib/libvirt/images/ubuntu22.04_vagrant_box_image_0_1682757118_box.img'/>
        <backingStore/>
      </backingStore>
      <target dev='vda' bus='virtio'/>
      <alias name='ua-box-volume-0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </disk>
    <controller type='usb' index='0' model='piix3-uhci'>
      <alias name='usb'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/>
    </controller>
    <controller type='pci' index='0' model='pci-root'>
      <alias name='pci.0'/>
    </controller>
    <interface type='network'>
      <mac address='52:54:00:1d:0c:dc'/>
      <source network='vagrant-libvirt' portid='93e264c8-0612-45b6-80ee-d6425f20f683' bridge='virbr4'/>
      <target dev='vnet6'/>
      <model type='virtio'/>
      <driver iommu='off'/>
      <alias name='ua-net-0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
    </interface>
    <interface type='network'>
      <mac address='52:54:00:ae:3b:69'/>
      <source network='kubenet' portid='b18d6869-d244-46b6-ac33-845c4371756b' bridge='virbr5'/>
      <target dev='vnet7'/>
      <model type='virtio'/>
      <driver iommu='off'/>
      <alias name='ua-net-1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </interface>
    <serial type='pty'>
      <source path='/dev/pts/4'/>
      <target type='isa-serial' port='0'>
        <model name='isa-serial'/>
      </target>
      <alias name='serial0'/>
    </serial>
    <console type='pty' tty='/dev/pts/4'>
      <source path='/dev/pts/4'/>
      <target type='serial' port='0'/>
      <alias name='serial0'/>
    </console>
    <input type='mouse' bus='ps2'>
      <alias name='input0'/>
    </input>
    <input type='keyboard' bus='ps2'>
      <alias name='input1'/>
    </input>
    <graphics type='vnc' port='5901' autoport='yes' websocket='5701' listen='127.0.0.1' keymap='en-us'>
      <listen type='address' address='127.0.0.1'/>
    </graphics>
    <audio id='1' type='none'/>
    <video>
      <model type='cirrus' vram='16384' heads='1' primary='yes'/>
      <alias name='video0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </video>
    <memballoon model='virtio'>
      <stats period='5'/>
      <alias name='balloon0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </memballoon>
  </devices>
  <seclabel type='dynamic' model='apparmor' relabel='yes'>
    <label>libvirt-ca593343-06e2-4016-9952-e74141910da6</label>
    <imagelabel>libvirt-ca593343-06e2-4016-9952-e74141910da6</imagelabel>
  </seclabel>
  <seclabel type='dynamic' model='dac' relabel='yes'>
    <label>+64055:+129</label>
    <imagelabel>+64055:+129</imagelabel>
  </seclabel>
</domain>

Any Ideas?

electrofelix commented 1 year ago

https://bugzilla.redhat.com/show_bug.cgi?id=1821277 suggests that the message from libvirtd is not necessarily an issue, it depends on whether there was an operation that was outstanding and the client closed early.

I don't have anything to suggest that would be an obvious fix. Try adding options around ssh keep alive would be one possibility.

Experimenting with setting the nic_model_type might be a good option here, I've found some time ago that some network emulation worked better than others and it varied across distro and kernels. Usually virtio is good, but it might be no harm to try some of the others. https://vagrant-libvirt.github.io/vagrant-libvirt/configuration.html#domain-specific-options

You could also try modifying management_network_mtu for the management network and libvirt__mtu for the private network to values less than 1500, in case you are finding that you are bumping into some issue where occasionally sshd has an issue with the message coming through. I've seen issues with ssh hangs with mtu set at 1500, but usually only for physical networks when you discover some device in the network path is not being careful with fragmenting packages. I wouldn't expect it to be an issue for connection to a VM unless there is a driver bug in there as well.

The fact that the sshd instance in guest stops accepting the ssh key suggests it might be a guest side issue. Any chance that with you have something running in the guess that keeps consuming memory and after a little while it consumes all the memory? Results in the sshd hang until the OOM in the kernel causes sshd to be restarted which would allow connecting after a little while. But it may not have enough memory to then read the authorized keys for the user?

I'd try the following, at the very least if you can capture a log from the guess side of what is happening, there is a better chance of debugging.

It might be worth connecting to the guest console virsh console <vm> depending on whether kernel messages are sent to the console or already appear via the VNC terminal.

StPanning commented 1 year ago

I was able to resolve this.

The error was in my custom built ubuntu box.

I packed the box image without removing /etc/machine-id Because of this every box derived from this box gets the same ip-address assigned in the management-network, even if the mac-addresses of the management interfaces are different. Why is this the case? This is really hard to debug.

Deleting /etc/machine-id is only the first step, because then the management network interface doesn't get any ip-address assigned.

My solution to this problem:

Before you pack your custom image login the vm that you want to pack remove /etc/machine-id create a script, that creates the machine-id if it not already exists

cat <<EOF>/usr/local/bin/init_machine_id.sh 
#!/bin/bash

if [ -e /etc/machine-id ]
then
    exit 0
fi

/usr/bin/systemd-machine-id-setup
EOF

chmod +x /usr/local/bin/init_machine_id.sh 

create a service that calls the script, before the network.target is started

cat <<EOF>/etc/systemd/system/init_machine_id.service
[Unit]
Description=Initialize Machine-ID

Before=network-pre.target
Wants=network-pre.target

[Service]
Type=simple
ExecStart=/usr/local/bin/init_machine_id.sh
Restart=on-failure
RestartSec=10
KillMode=process

[Install]
WantedBy=network.target
EOF

chmod +x /etc/systemd/system/init_machine_id.service

Is there a more straight forward solution?

rgl commented 1 year ago

@StPanning, for Ubuntu, I'm using a simpler solution at https://github.com/rgl/ubuntu-vagrant/blob/2e72a6de546b2056df330f5f441c546aaced1ab2/provision.sh#L90-L97