Closed taldcroft closed 2 years ago
About creating the lock file, you just answered your question:
The question is if there is an actual case in which one gets an undesired result. Did you get a failure in some case?
If you do not have write access, the file is not created, but you are not going to write to the file anyway, so there is no lock.
The historical point of contention is an aca
cron job writing at the same time a user application is reading. But if a user is in the middle of reading from the file and cannot make a lock and then the cron job comes along and creates a lock file and starts writing, there is a chance of an exception.
Since the current locking implementation does not robustly prevent the problem, the suggestion is to just to apply a mitigation (run the cron update process when users are very unlikely to be reading this database) and then simplify the code by removing all the locking entirely and use a simple retry strategy. We have a lot of experience with contention issues like this and I don't recall the HDF5 file ever actually getting corrupted.
I am already removing the file lock, but for the record: that specific failure you list is a case we normally do not get.
The code first tries to set the lock. If it fails, then no lock is created. This is ok if the file permission restriction is on the directory. If the code does not have write access to the directory, it can't modify the file, so there is no need to lock it. If some other code is modifying the file, then it has write access to the directory and would have created the lock, so everything is fine.
In your example, the user has write access to the directory but not the actual file. In this case the code fails as you showed.
The way I see it, it is a matter of taste. Both approaches can fail in principle. What I implemented would not fail in our standard setup.
The scenario I was suggesting is below. Here user
is a normal user with no write permissions, while aca
is the aca user that can write.
user
initiates a read and opens the file for reading
aca
initiates an update process and opens the file for writing while the file is still open by user
.
aca
but user
is already has the file open for reading so the locking has not had any effect.aca
process is writing to the file at the same time user
is reading the file, with potential contention issues.I agree that if aca
starts writing first, then the lock is effective in preventing contention, so perhaps locking here solves half the problem.
There is also some historical knowledge that file locking on NFS systems using flock
may not work anyway, but this was something that was discussed 10 years ago and I don't know if it was solved since then.
In your example, the user has write access to the directory but not the actual file. In this case the code fails as you showed.
I made a mistake by opening an issue with two issues (1) file locking and (2) opening astromon.h5 with r+ access. I think your comment above is referring to the second one? That is now #24.
While evaluating the cross-matches PR, I found myself trying to understand the code that reads the table file.
The file locking mechanism requires writing a lock file into the same directory as the
astromon.h5
file, which will always fail on HEAD or GRETA. I spent a while understanding why it was working at all until I noticed that it is catching a permissions error and just ignoring locking anyway. But basically HEAD is the only situation where locking might actually be helpful during the (very rare) cases of trying to read at the same time as the database is being updated.The standard mitigation for conflicts like this is
ska_helpers.retry
to re-try a read if it fails. While this is far from elegant, it works well enough in situations like this (very-low traffic application).Finally, even when things are working fine, I noticed that
filelock
creates and then leaves a.astromon.h5.lock
file around. Don't know what is going on but this seems untidy.