Open atz opened 7 years ago
ActiveFedora objects are locked using etags using the OptimisticLockValidator
. You suggested we should just use ActiveRecord's optimistic locking, but that's not an option given that these are ActiviveFedora objects.
The locking you pointed to in CC, could be good to add here, but if I recall correctly that was only required when adding multiple Files per FileSet as UCSB is doing here: https://github.com/curationexperts/alexandria/blob/7e6dfafefb2cd9bc17aa956208e8a86495e5b1eb/app/models/file_set.rb#L5
The Redis lock is used when we need a global lock on a particular object. Specifically when we are appending to the end of a linked list. This is how FileSets are stored under a work, if we didn't lock they could be added as a tree instead of a linked list.
@jcoyne: I didn't recommend using AR optimistic locking, I characterized the optimistic locking we do use as non-AR. You restate several other things I described, including existing limitations, but it is unclear why. Are you saying:
OptimisticLockValidator
usage by jobs is sufficient?That is the point of this issue. My interpretation is "none of the above".
Needing a global lock for an object is true for any write operation in a parallelized context. If attaching a fileset to a work is important enough, why wouldn't other operations be also? Are you saying that is the only place we have a linked list in play?
I see CreateWorkJob
uses Hyrax::CurationConcern.actor
, which includes OptimisticLockValidator
. Hyrax::WorksControllerBehavior
uses it too. But I don't see where any other jobs or controllers use the same thing. For whatever reason, we have separate file_set_create_actor
and file_set_update_actor
that ignore the user/Engine-configured actor_factory
and thereby don't get the locking. We have several different approaches that don't work together and none even claims to solve the whole problem.
Background
We have a large number of (potentially) parallel and asynchronous jobs around file ingest and processing. Indeed, parallel and asynchronous processing is the point of a Job/worker infrastructure. But many processes trying to write data to the same blob/object/row introduces data integrity problems. A
.save
from an older in-memory instance of an object overwrites recent updates. (I suspect this relates to what Hyku is seeing with multiple files not attaching.)So at some level we know that transactionality is a problem, and not just our usual problem of active-fedora not being able to provide transactionality over fedora and solr, but basic "many updaters to the same object" transactionality.
Locking, Current State
ActiveFedora
Correct me if I'm wrong, but my understanding of ActiveFedora is that although Solr provides functionality for locking via document
__version__
and Fedora supports ETags, nothing in ActiveFedora makes use of this by default, or in our current ActiveFedora configuration. WhereWorkBehavior
merely exposes the LDP etag, it has a comment:TODO: Move this into ActiveFedora
. So to the extent we do any locking, it is outside ActiveFedora.Actor stack
In the Actor stack, some attention was paid to locking, using a homegrown
OptimisticLockValidator
, not ActiveRecord's optimistic locking. However, as a comment attests,Caveat: we are not detecting if the version is changed by a different process between the time this validator is run and when the object is saved
. So it is distinctly unsuitable for the job context.The Actor stack also has pessimistic
ActiveRecord::Base.transaction
locks that wrap entire chains of other actors. But since actors also update fedora and solr objects without being able to fit them in the same transaction architecture, these are potentially counterproductive. Fine for happy path, 😭 for Samvera.Redis locking
Given that Actor stack and AR locks were unsuitable for the problem of parallel writes to fedora, we created redis locks via
Hyrax::Lockable
andHyrax::LockManager
. But we only use them in one place,Hyrax::Actors::FileSetActors#attach_to_work
:While it is clear enough to me that case makes sense, it seems really weird that we basically treat every other operation, including everything done by async jobs as though it'll all just be fine.
It's Not Fine
create_content
has a race condition for what the default file_set label and title end up being, getting invoked from 2 separate async job classes, potentially thousands of simultaneously queued job instances. Without locking, it is entirely possible that separate processes each (over)write their own default values. Maybe that doesn't matter much, but is a simple case that illustrates the broader problem.Proposed actions
We should review everything that happens inside the aperture of async jobs and make sure each operation has the correct flavor(s) of locking applied. I don't expect to resolve the active-fedora transactionality issue, but we can definitely improve our logic.