samvera / bulkrax

Bulk Import and Export For Samvera
Apache License 2.0
12 stars 21 forks source link

Improve CreateRelationshipsJob Performance #758

Closed cudevmaxwell closed 1 year ago

cudevmaxwell commented 1 year ago

In the csv_parser, each record is imported using it's own import job. In contrast, all the work of adding records to collections happens serially in CreateRelationshipsJob.

One way to parallelize this would be to refactor CreateRelationshipsJob into CreateRelationshipJob. Move the logic from CreateRelationshipsJob's perform into ScheduleRelationshipsJob's perform, and use CreateRelationshipsJob's process as CreateRelationshipJob's perform. Each CreateRelationshipJob would only be responsible for adding the relationship for one record, not all the records in an import.

Or, a new job called AddToCollectionJob could be written:

module Bulkrax
  class AddToCollectionJob < ApplicationJob
    queue_as :import

    # rubocop:disable Rails/SkipsModelValidations
    def perform(child_record, parent_record)
      child_record.member_of_collections << parent_record
      child_record.save!
    end
end

then rewrite CreateRelationshipsJob to schedule AddToCollectionJobs in process.

This would speed up record imports significantly.

cudevmaxwell commented 1 year ago

Proof of concept here: https://github.com/samvera-labs/bulkrax/compare/main...cudevmaxwell:bulkrax:main

_, child_record = find_record(child_record_id, importer_run_id)
_, parent_record = find_record(parent_record_id, importer_run_id)
child_record.member_of_collections << parent_record
child_record.save!

These 4 lines are still impressively slow, but at least it can happen in parallel.

cudevmaxwell commented 1 year ago

Next question: Why does child_record.save! cause so much indexing?

https://github.com/samvera-labs/samvera-nesting_indexer is doing a LOT of work. It should just be indexing the parent and the child.

It looks like, instead, it's reindexing all the children of the parent collection.

cudevmaxwell commented 1 year ago

Debug notes:

child_record.member_of_collections << parent_record

active_fedora/associations/builder/association.rb:

    94|         def member_of_collections(*args)
=>  95|           association(:member_of_collections).reader(*args)
    96|         end

active_fedora/associations/collection_proxy.rb:

   857|       def <<(*records)
=> 858|         proxy_association.concat(records) && self
   860|       alias push <<

active_fedora/associations/indirectly_contains_association.rb:

     5|     # See https://github.com/samvera/active_fedora/issues/1332
     6|     class IndirectlyContainsAssociation < ContainsAssociation #:nodoc:
     7|       # Add +records+ to this association.  Returns +self+ so method calls may be chained.
     8|       # Since << flattens its argument list and inserts each record, +push+ and +concat+ behave identically.
     9|       def concat(*records)
=>  10|         concat_records(records)

active_fedora/associations/collection_association.rb:

   156|
   157|       def concat_records(*records)
   158|         result = true
   159|
=> 161|           raise_on_type_mismatch!(record)
   162|           add_to_target(record) do |_r|
   163|             result &&= insert_record(record) unless owner.new_record?
   164|           end
   165|         end

active_fedora/associations/indirectly_contains_association.rb:

     9|       def concat(*records)
    10|         concat_records(records)
    11|       end
    12|
    13|       def insert_record(record, force = true, validate = true)
=>  14|         container.save!
    16|           record.save!
    17|         else
    18|           return false unless record.save(validate: validate)

Callback hell, out of time for today.

There's a very long pause after active_fedora/relation.rb's initialize returns. active_fedora/core.rb's Relation.new(self) is very slow.

jeremyf commented 1 year ago

@cudevmaxwell

Next question: Why does child_record.save! cause so much indexing?

https://github.com/samvera-labs/samvera-nesting_indexer is doing a LOT of work. It should just be indexing the parent and the child.

It looks like, instead, it's reindexing all the children of the parent collection.

The nesting_indexer is doing a lot of work and there has been back ported effort to remove that component in v3.4.x of Hyrax (see https://github.com/samvera/hyrax/commit/f23358d54d3f6273447de1959707cdd80928642c). Also there is https://rubygems.org/gems/hyrax-v2_graph_indexer which can be added to Hyrax 2.9.x repositories to eliminate that.

jeremyf commented 1 year ago

@cudevmaxwell another thing we are observing is related to jobs being slow; the same job can run inline and run fast but when brokered by multi-threaded Sidekiq, it slows down (see https://samvera.slack.com/archives/C0F9JQJDQ/p1678919763897689)

cudevmaxwell commented 1 year ago

Thanks for commenting @jeremyf. Maybe the 4.0.0 release of Hyrax would fix some of the performance issues.

I forked bulkrax and 'job-ified' the task of adding a child to a collection. That helped, but the over-indexing was killing performance.

Looking at Hyrax::Actors::CollectionsMembershipActor#add, to see what vanilla Hyrax does to add a record to a collection, I saw this:

collection.try(:reindex_extent=, Hyrax::Adapters::NestingIndexAdapter::LIMITED_REINDEX)

So I added that to the job as well:

# frozen_string_literal: true

module Bulkrax
  class AddToCollectionJob < ApplicationJob
    include DynamicRecordLookup
    queue_as :import

    # rubocop:disable Rails/SkipsModelValidations
    def perform(importer_run_id, child_record_id, parent_record_id)
      _, child_record = find_record(child_record_id, importer_run_id)
      _, parent_record = find_record(parent_record_id, importer_run_id)
      parent_record.try(:reindex_extent=, Hyrax::Adapters::NestingIndexAdapter::LIMITED_REINDEX)
      child_record.member_of_collections << parent_record
      child_record.save!
    end
    # rubocop:enable Rails/SkipsModelValidations
  end
end

Running this with a test import of 1000 records, we saw a huge improvement in performance. In the log, I'm still seeing reindex events which seem superfluous, but it's not a huge amount per import anymore.

It might be the case that keeping the Create Relationships job 'single threaded' might help with performance further, I still need to test that. But it seems that parent_record.try(:reindex_extent=, Hyrax::Adapters::NestingIndexAdapter::LIMITED_REINDEX) makes a huge difference.

cudevmaxwell commented 1 year ago

Testing a smaller change here: https://github.com/cu-library/bulkrax/commit/91c6be0e0d8adadface16a8db849d5408aea7860

jwhitney commented 1 year ago

Testing a smaller change here: cu-library@91c6be0

After this change, we imported 1000 works in ~ 2.5 hrs, vs CreateRelationships failing to complete for the same size of import in 3, 4 days in Hyrax v3.5 & Bulkrax v5.1.

An earlier fork that created a job per adding a relationship wasn't any faster than the single threaded solution.

cudevmaxwell commented 1 year ago

It's possible that setting HYRAX_USE_SOLR_GRAPH_NESTING in the env to 'true' might avoid this overindexing (without code changes) AND use the more efficient Solr graph queries from the upcoming 4.0.0 release. That's probably why no one else saw this performance hit after the latest Bulkrax release. They might already have that option enabled in their Hyrax 3.5.0-based IRs.

cudevmaxwell commented 1 year ago

Running a test import with HYRAX_USE_SOLR_GRAPH_NESTING = true with an unmodified Bulkrax resulted in a very speedy CreateRelationshipsJob. A 1000 item import completed in ~1.5 hours. It looks like we've found a solution which is also forwards compatible with the upcoming Hyrax 4.0.0. Closing.