Locking in the context of asynchronous jobs

atz commented 7 years ago

Background

We have a large number of (potentially) parallel and asynchronous jobs around file ingest and processing. Indeed, parallel and asynchronous processing is the point of a Job/worker infrastructure. But many processes trying to write data to the same blob/object/row introduces data integrity problems. A .save from an older in-memory instance of an object overwrites recent updates. (I suspect this relates to what Hyku is seeing with multiple files not attaching.)

So at some level we know that transactionality is a problem, and not just our usual problem of active-fedora not being able to provide transactionality over fedora and solr, but basic "many updaters to the same object" transactionality.

Locking, Current State

ActiveFedora

Correct me if I'm wrong, but my understanding of ActiveFedora is that although Solr provides functionality for locking via document __version__ and Fedora supports ETags, nothing in ActiveFedora makes use of this by default, or in our current ActiveFedora configuration. Where WorkBehavior merely exposes the LDP etag, it has a comment: TODO: Move this into ActiveFedora. So to the extent we do any locking, it is outside ActiveFedora.

Actor stack

In the Actor stack, some attention was paid to locking, using a homegrown OptimisticLockValidator, not ActiveRecord's optimistic locking. However, as a comment attests, Caveat: we are not detecting if the version is changed by a different process between the time this validator is run and when the object is saved. So it is distinctly unsuitable for the job context.

The Actor stack also has pessimistic ActiveRecord::Base.transaction locks that wrap entire chains of other actors. But since actors also update fedora and solr objects without being able to fit them in the same transaction architecture, these are potentially counterproductive. Fine for happy path, 😭 for Samvera.

Redis locking

Given that Actor stack and AR locks were unsuitable for the problem of parallel writes to fedora, we created redis locks via Hyrax::Lockable and Hyrax::LockManager. But we only use them in one place, Hyrax::Actors::FileSetActors#attach_to_work:

  def attach_to_work(work, file_set_params = {})
    acquire_lock_for(work.id) do
      # Ensure we have an up-to-date copy of the members association, so that we append to the end of the list.
      work.reload unless work.new_record?
      copy_visibility(work, file_set) unless assign_visibility?(file_set_params)
      work.ordered_members << file_set
      set_representative(work, file_set)
      set_thumbnail(work, file_set)
      # Save the work so the association between the work and the file_set is persisted (head_id)
      # NOTE: the work may not be valid, in which case this save doesn't do anything.
      work.save
      Hyrax.config.callback.run(:after_create_fileset, file_set, user)
    end
  end

While it is clear enough to me that case makes sense, it seems really weird that we basically treat every other operation, including everything done by async jobs as though it'll all just be fine.

It's Not Fine

Actor stack not compatible w/ Job architecture.
create_content has a race condition for what the default file_set label and title end up being, getting invoked from 2 separate async job classes, potentially thousands of simultaneously queued job instances. Without locking, it is entirely possible that separate processes each (over)write their own default values. Maybe that doesn't matter much, but is a simple case that illustrates the broader problem.
CurationConcerns has felt compelled to add additional locking here: https://github.com/samvera/curation_concerns/commit/d8487a06b5f1a26743dd15f3fafd4072c0149277 -- we should consider whether we need to follow suit.
We don't use fedora locking or etags (for blocking writes) or Solr optimistic locks or AR optimistic locking, but we use a shotgun pessimistic AR lock sometimes and over things it can't lock and a redis lock in one place. This is not coherent.

Proposed actions

We should review everything that happens inside the aperture of async jobs and make sure each operation has the correct flavor(s) of locking applied. I don't expect to resolve the active-fedora transactionality issue, but we can definitely improve our logic.

jcoyne commented 7 years ago

ActiveFedora objects are locked using etags using the OptimisticLockValidator. You suggested we should just use ActiveRecord's optimistic locking, but that's not an option given that these are ActiviveFedora objects.

The locking you pointed to in CC, could be good to add here, but if I recall correctly that was only required when adding multiple Files per FileSet as UCSB is doing here: https://github.com/curationexperts/alexandria/blob/7e6dfafefb2cd9bc17aa956208e8a86495e5b1eb/app/models/file_set.rb#L5

The Redis lock is used when we need a global lock on a particular object. Specifically when we are appending to the end of a linked list. This is how FileSets are stored under a work, if we didn't lock they could be added as a tree instead of a linked list.

atz commented 7 years ago

@jcoyne: I didn't recommend using AR optimistic locking, I characterized the optimistic locking we do use as non-AR. You restate several other things I described, including existing limitations, but it is unclear why. Are you saying:

the locking we have is entirely coherent and sufficient for use in async jobs?
that there is only one place we need to use redis lock?
that current OptimisticLockValidator usage by jobs is sufficient?

That is the point of this issue. My interpretation is "none of the above".

Needing a global lock for an object is true for any write operation in a parallelized context. If attaching a fileset to a work is important enough, why wouldn't other operations be also? Are you saying that is the only place we have a linked list in play?

I see CreateWorkJob uses Hyrax::CurationConcern.actor, which includes OptimisticLockValidator. Hyrax::WorksControllerBehavior uses it too. But I don't see where any other jobs or controllers use the same thing. For whatever reason, we have separate file_set_create_actor and file_set_update_actor that ignore the user/Engine-configured actor_factory and thereby don't get the locking. We have several different approaches that don't work together and none even claims to solve the whole problem.

samvera / hyrax