pcdshub / happi

Heuristic Access to Positions of Photon Instruments
https://pcdshub.github.io/happi/master
Other
13 stars 29 forks source link

BUG: Interrupt during save operation can corrupt database #301

Closed tangkong closed 1 year ago

tangkong commented 1 year ago

Expected Behavior

Saving should always at least return the database to a readable state, but preferably complete any atomic operation before terminating.

Current Behavior

If a save operation is interrupted, the current save operation will not be completed. If the interruption is caught and save() is called again, the database can be restored. (this is the solution used in #300, but should be replaced when a more general solution is found)

The below seems to work:

try:
    result.item.save()
except KeyboardInterrupt:
    result.item.save()
    break

Possible Solution

Context managers that catch all exceptions and properly close out the files? In any case we need to make the save operation safer, and more tolerant of interruption. It's possible that something like the above try-except inside the backend .save() method could work, but that's speculation on my part

Steps to Reproduce (for bugs)

Start a sequence of save actions using the happi client, even without edits to an item (though last edit date will change) Ctrl+C / KeyboardInterrupt / SigInt Depending on timing the database may be in various states of completeness.

image

Here happi repair is run on all the devices with debug output. The keyboard interrupt occurs after the backend .update() method appears to resolve, but it's clear the file writing is not complete. The diff in the json database is shown on the left. Changes at the top are all valid last_edited modifications, while the bottom of the database is arbitrarily truncated.

Context

First uncovered when working on happi repair #300, but has existed for a while.

Your Environment

pcds-5.6.0

klauer commented 1 year ago

I think:

ZLLentz commented 1 year ago

Building a second file and moving/copying it on top of the original is the normal way to do atomic updates safely. This affects only the json backend, correct?

tangkong commented 1 year ago

I've never tested a different backend, but I'd (maybe too optimistically) expect a mongo backend to behave better

klauer commented 1 year ago

It's the database's job to handle atomic updates coherently/safely - so yes, our JSON backend should be the only one affected.

KeyboardInterrupt could stop a transaction from taking place with mongo or failing to get if the insert/update succeeded (but that's not fatal)