impy-project / chromo

Hadronic Interaction Model interface in PYthon
Other
30 stars 7 forks source link

Fix brittle lazy download of tables #144

Open HDembinski opened 1 year ago

HDembinski commented 1 year ago

I am still in favor of delaying downloading data tables for a model until that model is actually initialized for the first time, so that a user does not have to wait for the download of large tables for a model they never use. It takes some time to download the QGSJet tables and it is not unlikely that we add other models in the future with large tables.

However, we already know that there are issues when code is run in parallel and several instances try to download the same files concurrently, which is causing race conditions. We have that issue in our unit tests and I just got feedback from an early adopter who tried to run our notebook compare_models.ipynb on a fresh chromo installation and ran into that problem as well.

I see two options.

1) We make the lazy download less lazy. If we download all tables already when chromo is imported for the first time, it is likely that the tables are already there when the first concurrent code runs. This is not a guarantee, though, so it is not a clean solution. It also downloads all tables for all models, even if a user never needs one of those tables.

2) We properly handle the race condition. When a download is started by one process, other processes wait until the download is complete instead of starting their own. Typically, the approach is the following. The first process which attempts to download something creates a lock file until the download is completed, then the lock file is deleted. Other processes wait if the lock file is presented, until the download is complete.

This algorithm works, but only if the combined act of checking for the existance of a lock file and creating it otherwise is effectively an atomic operation. Otherwise, we still have potential a race condition, when two processes check for the existence of the lock file at the same time and find both it does not exist, and then both try to create the lock file and start the download. Fortunately, file creation is atomic on all platforms, so an algorithm would have to do the following. 1) If lock file exists, wait until download is complete by periodically checking whether the lock file still exists. When the lock file is not there anymore, check if the requested file exists. This could fail, because the other process that ran the download may have been aborted by an exception. Continue normally if the requested file exists or raise an exception at this point (trying again to download the file seems futile at this point, either there is an error which prevents the download (internet down, ...) or the user does not want to download the file. If the lock file does not exist, go to step 2. 2) Attempt to create lock file. If this succeeds, start download. Remove lock file when download is aborted (any exception occurs, including KeyboardInterrupt) or when the download is complete. If lock file creation fails, check again if the lock file exists. If it does, another process has created the lock file in the meantime; go back to step 1. If creation fails and the lock file does not exist, raise an exception. Something weird is going on, e.g. disk quota is exceeded or the user tries to write to a read-only file system.

@jncots Do you want to give this a try?