sot / astromon

Astrometric accuracy (celestial location) monitor
0 stars 0 forks source link

File locking is not helping and opening with r+ permissions can fail #21

Closed taldcroft closed 2 years ago

taldcroft commented 2 years ago

While evaluating the cross-matches PR, I found myself trying to understand the code that reads the table file.

The file locking mechanism requires writing a lock file into the same directory as the astromon.h5 file, which will always fail on HEAD or GRETA. I spent a while understanding why it was working at all until I noticed that it is catching a permissions error and just ignoring locking anyway. But basically HEAD is the only situation where locking might actually be helpful during the (very rare) cases of trying to read at the same time as the database is being updated.

The standard mitigation for conflicts like this is ska_helpers.retry to re-try a read if it fails. While this is far from elegant, it works well enough in situations like this (very-low traffic application).

Finally, even when things are working fine, I noticed that filelock creates and then leaves a .astromon.h5.lock file around. Don't know what is going on but this seems untidy.

javierggt commented 2 years ago

About creating the lock file, you just answered your question:

The question is if there is an actual case in which one gets an undesired result. Did you get a failure in some case?

taldcroft commented 2 years ago

If you do not have write access, the file is not created, but you are not going to write to the file anyway, so there is no lock.

The historical point of contention is an aca cron job writing at the same time a user application is reading. But if a user is in the middle of reading from the file and cannot make a lock and then the cron job comes along and creates a lock file and starts writing, there is a chance of an exception.

Since the current locking implementation does not robustly prevent the problem, the suggestion is to just to apply a mitigation (run the cron update process when users are very unlikely to be reading this database) and then simplify the code by removing all the locking entirely and use a simple retry strategy. We have a lot of experience with contention issues like this and I don't recall the HDF5 file ever actually getting corrupted.

javierggt commented 2 years ago

I am already removing the file lock, but for the record: that specific failure you list is a case we normally do not get.

The code first tries to set the lock. If it fails, then no lock is created. This is ok if the file permission restriction is on the directory. If the code does not have write access to the directory, it can't modify the file, so there is no need to lock it. If some other code is modifying the file, then it has write access to the directory and would have created the lock, so everything is fine.

In your example, the user has write access to the directory but not the actual file. In this case the code fails as you showed.

The way I see it, it is a matter of taste. Both approaches can fail in principle. What I implemented would not fail in our standard setup.

taldcroft commented 2 years ago

The scenario I was suggesting is below. Here user is a normal user with no write permissions, while aca is the aca user that can write.

I agree that if aca starts writing first, then the lock is effective in preventing contention, so perhaps locking here solves half the problem.

There is also some historical knowledge that file locking on NFS systems using flock may not work anyway, but this was something that was discussed 10 years ago and I don't know if it was solved since then.

taldcroft commented 2 years ago

In your example, the user has write access to the directory but not the actual file. In this case the code fails as you showed.

I made a mistake by opening an issue with two issues (1) file locking and (2) opening astromon.h5 with r+ access. I think your comment above is referring to the second one? That is now #24.