teemtee / tmt

Test Management Tool
MIT License
85 stars 126 forks source link

rsync fails while copying data from an already executed test #3359

Open tuliom opened 3 days ago

tuliom commented 3 days ago

Recently we started to see rsync failures on the Testing Farm. The error happens from time to time, so I believe there is some parallelism taking place. This is happening with tmt version 1.38.0

I believe the following fragment helps to understand the order of the events:

        execute task #1: clang-tests on default-0
        how: tmt
            00:00:00 pass /clang-tests/tests/clang-format-diff (on default-0) [1/15]
            00:00:03 pass /clang-tests/tests/fedora-flags (on default-0) [2/15]
            00:00:01 pass /clang-tests/tests/gcc-clang-compatibility (on default-0) [3/15]
            00:00:00 pass /clang-tests/tests/libomp (on default-0) [4/15]
            00:32:56 pass /clang-tests/tests/llvm-test-suite (on default-0) [5/15]
            00:00:00 pass /clang-tests/tests/long-double (on default-0) [6/15]
            00:00:00 pass /clang-tests/tests/openmp-rpm (on default-0) [7/15]
            00:00:10 pass /clang-tests/tests/rhbz_1647130 (on default-0) [8/15]
            00:00:01 pass /clang-tests/tests/rhbz_1657544 (on default-0) [9/15]
            00:00:00 pass /clang-tests/tests/rhbz_2239619 (on default-0) [10/15]
            00:00:00 pass /clang-tests/tests/rhbz_482491 (on default-0) [11/15]
            00:00:00 pass /clang-tests/tests/rpmmacros (on default-0) [12/15]
            00:00:08 pass /clang-tests/tests/toolchains (on default-0) [13/15]
            00:00:01 pass /clang-tests/tests/ucrt64-toolchain (on default-0) [14/15]
            00:00:00 pass /clang-tests/tests/use-correct-dwarf-default (on default-0) [15/15]

        execute task #2: compiler-rt-tests on default-0
        how: tmt
            00:00:06 pass /compiler-rt-tests/tests/broken-symlinks (on default-0) [1/3]
            00:00:03 pass /compiler-rt-tests/tests/fp16-abi (on default-0) [2/3]
            00:00:00 pass /compiler-rt-tests/tests/sanity (on default-0) [3/3]

        execute task #3: libomp-tests on default-0
        how: tmt
            00:00:00 pass /libomp-tests/tests/sanity (on default-0) [1/1]

        execute task #4: llvm-tests on default-0
        how: tmt
        fail: Failed to push workdir to the guest. This usually means that login as 'root' to the guest does not work.
        fail: Command 'rsync -s -R -r -z --links --safe-links --delete -e 'ssh -oForwardX11=no -oStrictHostKeyChecking=no -oUserKnownHostsFile=/dev/null -oConnectionAttempts=5 -oConnectTimeout=60 -oServerAliveInterval=5 -oServerAliveCountMax=60 -oIdentitiesOnly=yes -p22 -i /etc/citool.d/id_rsa_artemis -oPasswordAuthentication=no' /var/ARTIFACTS/work-snapshot-gatingx591m28o/tests/snapshot-gating root@18.117.234.166:/' returned 24.

        summary: 19 tests executed
    finish

        summary: 0 tasks completed

plan failed

The exception was caused by 1 earlier exceptions

Cause number 1:

    execute step failed

    The exception was caused by 1 earlier exceptions

    Cause number 1:

        Command 'rsync -s -R -r -z --links --safe-links --delete -e 'ssh -oForwardX11=no -oStrictHostKeyChecking=no -oUserKnownHostsFile=/dev/null -oConnectionAttempts=5 -oConnectTimeout=60 -oServerAliveInterval=5 -oServerAliveCountMax=60 -oIdentitiesOnly=yes -p22 -i /etc/citool.d/id_rsa_artemis -oPasswordAuthentication=no' /var/ARTIFACTS/work-snapshot-gatingx591m28o/tests/snapshot-gating root@18.117.234.166:/' returned 24.

        stderr (3 lines)
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        Warning: Permanently added '18.117.234.166' (ED25519) to the list of known hosts.
        file has vanished: "/var/ARTIFACTS/work-snapshot-gatingx591m28o/tests/snapshot-gating/execute/data/guest/default-0/clang-tests/tests/rhbz_1647130-8/output.txt.copy"
        rsync warning: some files vanished before they could be transferred (code 24) at main.c(1338) [sender=3.3.0]
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

So, while copying the files for llvm-tests, the files from the other tests end up being copied too, while something else is still playing with output from clang-tests.

The workdir for this execution is available at: https://artifacts.dev.testing-farm.io/26f8be4d-e57f-46af-8098-9900b7bb8d6d/work-snapshot-gatingx591m28o/

happz commented 3 days ago

I don't think there's any *.copy created by tmt, I do suspect TF might be a culprit here.

@ukulekek since you were involved in the archiving part of TF recently, WDYT? Could it be that TF creates various *.copy files in tmt's workdir, removing them at some point? Because if that's the case, I think we see a race condition in action: tmt starting an rsync command, with output.txt.copy in the workdir, but the file then gets removed by TF while rsync is still running, which results in rsync reporting it as a vanishing file.

ukulekek commented 2 days ago

Yes, we caused the problem because we do create copies of files in the working directory and delete them after syncing to the artifacts storage server. I created TFT-2973 issue and will work on it, so it should land next release

tuliom commented 2 days ago

In that case, it's fair to close this issue, unless the community believes this report should trigger changes to tmt too. Thank you all!