samvera / hyrax

Hyrax is a Ruby on Rails Engine built by the Samvera community. Hyrax provides a foundation for creating many different digital repository applications.
http://hyrax.samvera.org/
Apache License 2.0
183 stars 124 forks source link

Locking in the context of asynchronous jobs #1308

Open atz opened 7 years ago

atz commented 7 years ago

Background

We have a large number of (potentially) parallel and asynchronous jobs around file ingest and processing. Indeed, parallel and asynchronous processing is the point of a Job/worker infrastructure. But many processes trying to write data to the same blob/object/row introduces data integrity problems. A .save from an older in-memory instance of an object overwrites recent updates. (I suspect this relates to what Hyku is seeing with multiple files not attaching.)

So at some level we know that transactionality is a problem, and not just our usual problem of active-fedora not being able to provide transactionality over fedora and solr, but basic "many updaters to the same object" transactionality.

Locking, Current State

ActiveFedora

Correct me if I'm wrong, but my understanding of ActiveFedora is that although Solr provides functionality for locking via document __version__ and Fedora supports ETags, nothing in ActiveFedora makes use of this by default, or in our current ActiveFedora configuration. Where WorkBehavior merely exposes the LDP etag, it has a comment: TODO: Move this into ActiveFedora. So to the extent we do any locking, it is outside ActiveFedora.

Actor stack

In the Actor stack, some attention was paid to locking, using a homegrown OptimisticLockValidator, not ActiveRecord's optimistic locking. However, as a comment attests, Caveat: we are not detecting if the version is changed by a different process between the time this validator is run and when the object is saved. So it is distinctly unsuitable for the job context.

The Actor stack also has pessimistic ActiveRecord::Base.transaction locks that wrap entire chains of other actors. But since actors also update fedora and solr objects without being able to fit them in the same transaction architecture, these are potentially counterproductive. Fine for happy path, 😭 for Samvera.

Redis locking

Given that Actor stack and AR locks were unsuitable for the problem of parallel writes to fedora, we created redis locks via Hyrax::Lockable and Hyrax::LockManager. But we only use them in one place, Hyrax::Actors::FileSetActors#attach_to_work:

  def attach_to_work(work, file_set_params = {})
    acquire_lock_for(work.id) do
      # Ensure we have an up-to-date copy of the members association, so that we append to the end of the list.
      work.reload unless work.new_record?
      copy_visibility(work, file_set) unless assign_visibility?(file_set_params)
      work.ordered_members << file_set
      set_representative(work, file_set)
      set_thumbnail(work, file_set)
      # Save the work so the association between the work and the file_set is persisted (head_id)
      # NOTE: the work may not be valid, in which case this save doesn't do anything.
      work.save
      Hyrax.config.callback.run(:after_create_fileset, file_set, user)
    end
  end

While it is clear enough to me that case makes sense, it seems really weird that we basically treat every other operation, including everything done by async jobs as though it'll all just be fine.

It's Not Fine

Proposed actions

We should review everything that happens inside the aperture of async jobs and make sure each operation has the correct flavor(s) of locking applied. I don't expect to resolve the active-fedora transactionality issue, but we can definitely improve our logic.

jcoyne commented 7 years ago

ActiveFedora objects are locked using etags using the OptimisticLockValidator. You suggested we should just use ActiveRecord's optimistic locking, but that's not an option given that these are ActiviveFedora objects.

The locking you pointed to in CC, could be good to add here, but if I recall correctly that was only required when adding multiple Files per FileSet as UCSB is doing here: https://github.com/curationexperts/alexandria/blob/7e6dfafefb2cd9bc17aa956208e8a86495e5b1eb/app/models/file_set.rb#L5

The Redis lock is used when we need a global lock on a particular object. Specifically when we are appending to the end of a linked list. This is how FileSets are stored under a work, if we didn't lock they could be added as a tree instead of a linked list.

atz commented 7 years ago

@jcoyne: I didn't recommend using AR optimistic locking, I characterized the optimistic locking we do use as non-AR. You restate several other things I described, including existing limitations, but it is unclear why. Are you saying:

That is the point of this issue. My interpretation is "none of the above".

Needing a global lock for an object is true for any write operation in a parallelized context. If attaching a fileset to a work is important enough, why wouldn't other operations be also? Are you saying that is the only place we have a linked list in play?

I see CreateWorkJob uses Hyrax::CurationConcern.actor, which includes OptimisticLockValidator. Hyrax::WorksControllerBehavior uses it too. But I don't see where any other jobs or controllers use the same thing. For whatever reason, we have separate file_set_create_actor and file_set_update_actor that ignore the user/Engine-configured actor_factory and thereby don't get the locking. We have several different approaches that don't work together and none even claims to solve the whole problem.