imperialCHEPI / healthgps

Global Health Policy Simulation model (Health-GPS)
https://imperialchepi.github.io/healthgps/
BSD 3-Clause "New" or "Revised" License
5 stars 1 forks source link

Fix race conditions when extracting zip files #456

Closed alexdewar closed 5 days ago

alexdewar commented 1 week ago

Before extracting a zip file, we check whether we already have the contents of that file extracted to a cache folder and, if so, skip the extraction.

While this is safe to do if there is only one HealthGPS process running, if there are multiple processes attempting to extract the zip file at the same time, there is a potential race condition (#454):

  1. Process A observes that the cache folder doesn't exist, creates it and begins extraction
  2. Process B observes that the cache folder exists, tries to use it before extraction is complete and crashes
  3. Process A completes extraction

Note that while this sounds unlikely, it is entirely possible that this could happen with multiple jobs running on the HPC (particularly given that filesystem access is often slow on the Imperial HPC).

Another possible bug is that a HealthGPS could terminate before extraction is complete, in which case, again, the cache folder will exist, but only contain a subset of the data files. In this case, the only remedy would be to delete the cache folder -- not particularly user friendly.

This PR solves both of these problems by performing extraction to a temporary folder (with a unique name) before renaming the folder. While moving a folder is not guaranteed to be atomic because it may involve copying data between devices, it seems that renaming them should be (or at least it is on POSIX, which will fix the HPC case).

Fixes #454.

alexdewar commented 5 days ago

@TinyMarsh Ordinarily that would make sense, but the weird thing about what we're doing here is that we want to be able to move the temporary folder to its final destination atomically (i.e. so that another HPC job doesn't try to read the cache folder when it's only got half the files in) and moving a folder will only be atomic if it's in the same partition, which is why we do it in the same cache folder (e.g. on Linux /tmp is usually a ramdisk, which means the files would have to be copied).