mitogen-hq / mitogen

Distributed self-replicating programs in Python
https://mitogen.networkgenomics.com/
BSD 3-Clause "New" or "Revised" License
2.35k stars 199 forks source link

Play fails if temp directory is deleted mid-play #1061

Open markafarrell opened 7 months ago

markafarrell commented 7 months ago

If the ansible temp directory is removed mid-play mitogen does not recreate it and the play fails.

An exception occurred during task execution. To see the full traceback, use -vvv. The error was:     _os.mkdir(file, 0o700)
fatal: [172.17.0.9]: FAILED! => {"msg": "Unexpected failure during module execution: builtins.FileNotFoundError: [Errno 2] No such file or directory: '/tmp/.ansible-test/tmp/ansible_mitogen_runner_y_absj50'
  File \"<stdin>\", line 3876, in _dispatch_one
  File \"master:/home/xxxxxxxx/work/mitogen-repro/.venv/lib/python3.10/site-packages/ansible_mitogen/target.py\", line 415, in run_module
    return impl.run()
           ^^^^^^^^^^
  File \"master:/home/d384492/work/mitogen-repro/.venv/lib/python3.10/site-packages/ansible_mitogen/runner.py\", line 445, in run
    self.setup()
  File \"master:/home/d384492/work/mitogen-repro/.venv/lib/python3.10/site-packages/ansible_mitogen/runner.py\", line 934, in setup
    self._stdio = NewStyleStdio(self.args, self.get_temp_dir())
                                           ^^^^^^^^^^^^^^^^^^^
  File \"master:/home/d384492/work/mitogen-repro/.venv/lib/python3.10/site-packages/ansible_mitogen/runner.py\", line 361, in get_temp_dir
    self._temp_dir = tempfile.mkdtemp(
                     ^^^^^^^^^^^^^^^^^
  File \"/usr/lib/python3.11/tempfile.py\", line 507, in mkdtemp
    _os.mkdir(file, 0o700)
", "stdout": ""}

Using the normal ansible strategy the temp directory is recreated and the play succeeds.

Ansible version: 2.14.15

Host OS: Ubuntu (WSL2) Target OS: Debian12 (docker)

Host Python: Python 3.10.12 Target Python: Python 3.11.2

See https://github.com/markafarrell/mitogen-repro-issue-1061 for reproduction instructions

moreati commented 7 months ago

I'm attempting to reproduce this. Step 4 of https://github.com/markafarrell/mitogen-repro-issue-1061 doesn't leave a running container. Instead it immediately exits.

alex@ubuntu2004:~/mitogen-repro-issue-1061$ docker run -dt --name target-server \
    -v /sys/fs/cgroup:/sys/fs/cgroup:ro \
    --privileged \
    --rm \
    geerlingguy/docker-debian12-ansible:latest;

964532f2b017d53a6292b476e5e463e5157f8520db7e0a6ca6e4d3d3176885ee
alex@ubuntu2004:~/mitogen-repro-issue-1061$ docker ps
CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES
alex@ubuntu2004:~/mitogen-repro-issue-1061$ docker --version
Docker version 24.0.5, build 24.0.5-0ubuntu1~22.04.1
alex@ubuntu2004:~/mitogen-repro-issue-1061$ uname -a
Linux ubuntu2004 5.15.0-105-generic #115-Ubuntu SMP Mon Apr 15 09:52:04 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux
markafarrell commented 7 months ago

I'm guessing that you are using aarch64 is probably the issue.

image

There is an arm64 version of that image so it should work.

Do you get anything from:

docker logs target-server
moreati commented 7 months ago
alex@ubuntu2004:~/mitogen-repro-issue-1061$ docker rm target-server 
target-server
alex@ubuntu2004:~/mitogen-repro-issue-1061$ docker run -dt --name target-server -v /sys/fs/cgroup:/sys/fs/cgroup:ro --privileged geerlingguy/docker-debian12-ansible:latest;
dea854a953ce1386fcf0ca7b5a28065b5749c982dab711e98fb7210f5968ba39
alex@ubuntu2004:~/mitogen-repro-issue-1061$ docker logs target-server
systemd 252.22-1~deb12u1 running in system mode (+PAM +AUDIT +SELINUX +APPARMOR +IMA +SMACK +SECCOMP +GCRYPT -GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN +IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 -PWQUALITY +P11KIT +QRENCODE +TPM2 +BZIP2 +LZ4 +XZ +ZLIB +ZSTD -BPF_FRAMEWORK -XKBCOMMON +UTMP +SYSVINIT default-hierarchy=unified)
Detected virtualization docker.
Detected architecture arm64.

Welcome to Debian GNU/Linux 12 (bookworm)!

Failed to create /init.scope control group: Read-only file system
Failed to allocate manager object: Read-only file system
[!!!!!!] Failed to allocate manager object.
Exiting PID 1...
markafarrell commented 7 months ago

Can you try adding --cgroupns=host and change the mount to be rw?

https://serverfault.com/questions/1053187/systemd-fails-to-run-in-a-docker-container-when-using-cgroupv2-cgroupns-priva

moreati commented 7 months ago

That did it, and I see the _os.mkdir(file, 0o700) error. Which leads to the next questions

  1. Why don't the unit and integration tests see this? Which extra ingredient(s) matter - Debian 12? systemd? Something Jeff Geerling added?
  2. Can we reproduce it with the existing Mitogen CI images and/or the localhost test?
markafarrell commented 7 months ago
  1. Why don't the unit and integration tests see this? Which extra ingredient(s) matter - Debian 12? systemd? Something Jeff Geerling added?

So I think this will happen regardless of OS, systemd etc. The issue is that https://github.com/mitogen-hq/mitogen/blob/master/ansible_mitogen/runner.py#L361 we are essentially doing

mkdir {{ ansible_remote_tmp }}/ansible_mitogen_runner_{{ random stuff }}/

If ansible_remote_tmp doesn't exist this fails.

The existence of this (ansible_remote_tmp) is only checked once, just after we connect to the target, so if it is removed after the connection happens then we see this failure.

2. Can we reproduce it with the existing Mitogen CI images and/or the localhost test?

It should be very easy to reproduce for both localhost and any other image by using a playbook similar to what i have in my reproduction repo. If you can point me to where the test should live i can quickly create one.

moreati commented 7 months ago

There are unit tests that mention is_good_temp() in https://github.com/mitogen-hq/mitogen/blob/bb9c51b3e9cc39fceddd55578bb89680fa4e1acc/tests/ansible/tests/target_test.py#L31. Integration tests should probably be added amongst https://github.com/mitogen-hq/mitogen/blob/bb9c51b3e9cc39fceddd55578bb89680fa4e1acc/tests/ansible/integration/runner/all.yml.

For running tests I'm relying on the Azure CI, and (force) pushing changes. We can squash any interim/WIP commits afterwards.

moreati commented 6 months ago
  1. Why don't the unit and integration tests see this? Which extra ingredient(s) matter - Debian 12? systemd? Something Jeff Geerling added?

A factor I previously missed: the repro playbook in https://github.com/markafarrell/mitogen-repro-issue-1061/blob/262591aecadb3ae255c904de17617519f8389673/playbook.yml is explicitly deleting $ANSIBLE_REMOTE_TMP, it's not systemd or similar doing it behind the scenes. There's much less mystery here than I thought, if any.