teemtee / tmt

Test Management Tool
MIT License
80 stars 122 forks source link

Possible race condition in the `testcloud` plugin #2687

Closed psss closed 6 months ago

psss commented 7 months ago

Seems that /tests/prepare/multihost sometimes fails to connect to the guest.

[guest-3]         multihost name: guest-3
[guest-3]         arch: x86_64
[guest-3]         distro: Fedora Linux 39 (Cloud Edition)
[guest-3]         kernel: 6.5.6-300.fc39.x86_64
[guest-3]         package manager: dnf
[guest-3]         selinux: yes
[guest-3]         is superuser: yes
[guest-1]         finished
[guest-1]         fail: Failed to connect in 300s.
    finish

[guest-2]         guest: stopped
[guest-2]         guest: removed
[guest-3]         guest: stopped
[guest-3]         guest: removed

Here's an example job and one more. As @happz mentioned in #2677 this stinks with race conditions. @frantisekz, could you please have a look?

frantisekz commented 6 months ago

So, I digged into this a bit:

I was able to reproduce the issue with the following tmt plan (on some attempts):

/test:
    test: echo

/plan:
    execute:
        how: tmt
    discover:
        how: fmf

    provision:
        - how: virtual
        - how: virtual
        - how: virtual
        - how: virtual
        - how: virtual
        - how: virtual
        - how: virtual

The thing is, in cases where it fails like in the mentioned jobs, the ssh succeeds on another try. cloud-init data are generated and appended properly in the affected VMs, and my best guess is that the ssh connection is attempted before cloud-init finishes its job in the VM.

I've tried if something like disabling the ssh early boot would help (it would) and let cloud-init restart it only after it finishes what it needs to. The problem is that it seems to be impossible to pass grub arguments via libvirt (we would have to restructure it to use a direct kernel boot which is can of worms on its own).

The another possible way to handle it would be to append (tmt-side) "-o PasswordAuthentication=no" to the ssh connections that should be using ssh key. This way, the connection would fail instead of a password prompt and that should be handled just fine via tmt's retry mechanism already present.

I'll try to come up with a PR for this.

psss commented 5 months ago

@frantisekz, hmmm, seems the issue is still there. Here's a recent job where the multihost test failed. Now we have a detailed log as well.