It's probably less than 1% of total extraction executions, but with many hosts the overall failure rate is quite high.
We delegate the download & extraction to localhost, so this runs in parallel N times for N hosts but the paths are the same in every case, so it's concurrently extracting to the same destination path which I guess causes this, e.g, from the manual:
When extracting files, if tar discovers that the extracted file already exists, it normally replaces the file by removing it before extracting it, to prevent confusion in the presence of hard or symbolic links.
So there will be brief periods where a written file (in this case, the top level dir) will be deleted and replaced by a concurrent extraction and if some earlier writer tries to list it it will see it missing.
Besides the race, it's also inefficient to extract the same file N times.
I guess run_once would fix this if it wasn't for the complication of multi-arch. Maybe serial execution + skip if already exists would work with little refactoring of the existing approach?
I think this race existed before the big refactor in 0.20.0, but in the refactor list_files was added which may be the thing that triggers the specific failure above.
Actually maybe run_once would work? Because the TASK text above indicates that it is called for a single binary (arm64 here), so maybe all the per-host invocations will have the same result?
We see occasional failures of binary extraction after download like so:
It's probably less than 1% of total extraction executions, but with many hosts the overall failure rate is quite high.
We delegate the download & extraction to localhost, so this runs in parallel N times for N hosts but the paths are the same in every case, so it's concurrently extracting to the same destination path which I guess causes this, e.g, from the manual:
So there will be brief periods where a written file (in this case, the top level dir) will be deleted and replaced by a concurrent extraction and if some earlier writer tries to list it it will see it missing.
Besides the race, it's also inefficient to extract the same file N times.
I guess
run_once
would fix this if it wasn't for the complication of multi-arch. Maybeserial
execution + skip if already exists would work with little refactoring of the existing approach?I think this race existed before the big refactor in 0.20.0, but in the refactor
list_files
was added which may be the thing that triggers the specific failure above.