usnistgov / oar-pdr

The NIST Open Access to Research (OAR) Public Data Repository (PDR) system software
11 stars 10 forks source link

Fix NERDm file locking bug #131

Closed RayPlante closed 4 years ago

RayPlante commented 4 years ago

In the python-based publishing code, reading and writing of NERDm metadata files are supposed be protected in a multi-threaded/multi-processing application. In particular, some of the processing triggered by mdserver/preserver web service calls that result in updating metadata files is done asynchronously--i.e. in a separate thread--so that web service calls can return quickly. Despite the file locking that's in place, we would still on occasion see file write collisions: the same metadata data would get written twice sequentially into a file. This PR fixes that bug.

The file locking that was in place was the use of python's lockf() function. It turns out that this function only provides protection across multiple processes. This is important for protecting the metadata and preservation services (which run in separate pr from interfering with each other); however, it does not protected against colliding accesses across threads in the same python process.

The pdr.utils module was updated to add a new LockedFile class that provides both multiprocess locking (provided by lockf()) and multithread locking (provided by the lock classes available in the python threading module). It supports both shared locks--to allow unlimited simultaneous reads--and exclusive locks, ensuring only one process/thread can access a file during a write. This new locking mechanism was incoporated into read_json() and write_json().

RayPlante commented 4 years ago

The bug is difficult to replicate reliably which makes testing the change tricky, but I was able to accomplish this on my own platform under oar-docker for testing purposes. I am bypassing normal review to allow for testing with MIDAS on testdata.

RayPlante commented 4 years ago

Replicated bug on datapubtest; could not replicate after applying this fix.