prometheus-community / ansible

Ansible Collection for Prometheus
https://prometheus-community.github.io/ansible/
Apache License 2.0
383 stars 130 forks source link

Concurrent local extraction of binaries fails sometimes #457

Open travisdowns opened 2 hours ago

travisdowns commented 2 hours ago

We see occasional failures of binary extraction after download like so:

TASK [prometheus.prometheus._common : Unpack binary archive node_exporter-1.8.2.linux-arm64.tar.gz] ***
changed: [52.27.162.6 -> localhost]
changed: [34.212.107.78 -> localhost]
changed: [35.160.16.87 -> localhost]
changed: [34.211.30.255 -> localhost]
fatal: [35.91.204.184 -> localhost]: FAILED! => {"changed": true, "dest": "/tmp/node_exporter-linux-arm64/1.8.2", "extract_results": {"cmd": ["/usr/bin/tar", "--extract", "-C", "/tmp/node_exporter-linux-arm64/1.8.2", "-z", "--show-transformed-names", "--strip-components=1", "-f", "/root/.ansible/tmp/ansible-tmp-1730343705.1365232-1179-87748719038915/source"], "err": "", "out": "", "rc": 0}, "gid": 0, "group": "root", "handler": "TgzArchive", "mode": "0755", "msg": "Unexpected error when accessing exploded file: [Errno 2] No such file or directory: b'/tmp/node_exporter-linux-arm64/1.8.2/node_exporter'", "owner": "root", "size": 81, "src": "/root/.ansible/tmp/ansible-tmp-1730343705.1365232-1179-87748719038915/source", "state": "directory", "uid": 0}
changed: [54.212.243.52 -> localhost]
changed: [35.91.104.132 -> localhost]

It's probably less than 1% of total extraction executions, but with many hosts the overall failure rate is quite high.

We delegate the download & extraction to localhost, so this runs in parallel N times for N hosts but the paths are the same in every case, so it's concurrently extracting to the same destination path which I guess causes this, e.g, from the manual:

When extracting files, if tar discovers that the extracted file already exists, it normally replaces the file by removing it before extracting it, to prevent confusion in the presence of hard or symbolic links.

So there will be brief periods where a written file (in this case, the top level dir) will be deleted and replaced by a concurrent extraction and if some earlier writer tries to list it it will see it missing.

Besides the race, it's also inefficient to extract the same file N times.

I guess run_once would fix this if it wasn't for the complication of multi-arch. Maybe serial execution + skip if already exists would work with little refactoring of the existing approach?

I think this race existed before the big refactor in 0.20.0, but in the refactor list_files was added which may be the thing that triggers the specific failure above.

travisdowns commented 2 hours ago

Actually maybe run_once would work? Because the TASK text above indicates that it is called for a single binary (arm64 here), so maybe all the per-host invocations will have the same result?