Closed cudevmaxwell closed 1 year ago
Proof of concept here: https://github.com/samvera-labs/bulkrax/compare/main...cudevmaxwell:bulkrax:main
_, child_record = find_record(child_record_id, importer_run_id)
_, parent_record = find_record(parent_record_id, importer_run_id)
child_record.member_of_collections << parent_record
child_record.save!
These 4 lines are still impressively slow, but at least it can happen in parallel.
Next question: Why does child_record.save!
cause so much indexing?
https://github.com/samvera-labs/samvera-nesting_indexer is doing a LOT of work. It should just be indexing the parent and the child.
It looks like, instead, it's reindexing all the children of the parent collection.
Debug notes:
child_record.member_of_collections << parent_record
active_fedora/associations/builder/association.rb
:
94| def member_of_collections(*args)
=> 95| association(:member_of_collections).reader(*args)
96| end
active_fedora/associations/collection_proxy.rb
:
857| def <<(*records)
=> 858| proxy_association.concat(records) && self
860| alias push <<
active_fedora/associations/indirectly_contains_association.rb
:
5| # See https://github.com/samvera/active_fedora/issues/1332
6| class IndirectlyContainsAssociation < ContainsAssociation #:nodoc:
7| # Add +records+ to this association. Returns +self+ so method calls may be chained.
8| # Since << flattens its argument list and inserts each record, +push+ and +concat+ behave identically.
9| def concat(*records)
=> 10| concat_records(records)
active_fedora/associations/collection_association.rb
:
156|
157| def concat_records(*records)
158| result = true
159|
=> 161| raise_on_type_mismatch!(record)
162| add_to_target(record) do |_r|
163| result &&= insert_record(record) unless owner.new_record?
164| end
165| end
active_fedora/associations/indirectly_contains_association.rb
:
9| def concat(*records)
10| concat_records(records)
11| end
12|
13| def insert_record(record, force = true, validate = true)
=> 14| container.save!
16| record.save!
17| else
18| return false unless record.save(validate: validate)
Callback hell, out of time for today.
There's a very long pause after active_fedora/relation.rb's initialize returns. active_fedora/core.rb
's Relation.new(self)
is very slow.
@cudevmaxwell
Next question: Why does child_record.save! cause so much indexing?
https://github.com/samvera-labs/samvera-nesting_indexer is doing a LOT of work. It should just be indexing the parent and the child.
It looks like, instead, it's reindexing all the children of the parent collection.
The nesting_indexer is doing a lot of work and there has been back ported effort to remove that component in v3.4.x of Hyrax (see https://github.com/samvera/hyrax/commit/f23358d54d3f6273447de1959707cdd80928642c). Also there is https://rubygems.org/gems/hyrax-v2_graph_indexer which can be added to Hyrax 2.9.x repositories to eliminate that.
@cudevmaxwell another thing we are observing is related to jobs being slow; the same job can run inline and run fast but when brokered by multi-threaded Sidekiq, it slows down (see https://samvera.slack.com/archives/C0F9JQJDQ/p1678919763897689)
Thanks for commenting @jeremyf. Maybe the 4.0.0 release of Hyrax would fix some of the performance issues.
I forked bulkrax and 'job-ified' the task of adding a child to a collection. That helped, but the over-indexing was killing performance.
Looking at Hyrax::Actors::CollectionsMembershipActor#add
, to see what vanilla Hyrax does to add a record to a collection, I saw this:
collection.try(:reindex_extent=, Hyrax::Adapters::NestingIndexAdapter::LIMITED_REINDEX)
So I added that to the job as well:
# frozen_string_literal: true
module Bulkrax
class AddToCollectionJob < ApplicationJob
include DynamicRecordLookup
queue_as :import
# rubocop:disable Rails/SkipsModelValidations
def perform(importer_run_id, child_record_id, parent_record_id)
_, child_record = find_record(child_record_id, importer_run_id)
_, parent_record = find_record(parent_record_id, importer_run_id)
parent_record.try(:reindex_extent=, Hyrax::Adapters::NestingIndexAdapter::LIMITED_REINDEX)
child_record.member_of_collections << parent_record
child_record.save!
end
# rubocop:enable Rails/SkipsModelValidations
end
end
Running this with a test import of 1000 records, we saw a huge improvement in performance. In the log, I'm still seeing reindex events which seem superfluous, but it's not a huge amount per import anymore.
It might be the case that keeping the Create Relationships job 'single threaded' might help with performance further, I still need to test that. But it seems that parent_record.try(:reindex_extent=, Hyrax::Adapters::NestingIndexAdapter::LIMITED_REINDEX)
makes a huge difference.
Testing a smaller change here: https://github.com/cu-library/bulkrax/commit/91c6be0e0d8adadface16a8db849d5408aea7860
Testing a smaller change here: cu-library@91c6be0
After this change, we imported 1000 works in ~ 2.5 hrs, vs CreateRelationships failing to complete for the same size of import in 3, 4 days in Hyrax v3.5 & Bulkrax v5.1.
An earlier fork that created a job per adding a relationship wasn't any faster than the single threaded solution.
It's possible that setting HYRAX_USE_SOLR_GRAPH_NESTING in the env to 'true' might avoid this overindexing (without code changes) AND use the more efficient Solr graph queries from the upcoming 4.0.0 release. That's probably why no one else saw this performance hit after the latest Bulkrax release. They might already have that option enabled in their Hyrax 3.5.0-based IRs.
Running a test import with HYRAX_USE_SOLR_GRAPH_NESTING = true
with an unmodified Bulkrax resulted in a very speedy CreateRelationshipsJob
. A 1000 item import completed in ~1.5 hours. It looks like we've found a solution which is also forwards compatible with the upcoming Hyrax 4.0.0. Closing.
In the
csv_parser
, each record is imported using it's own import job. In contrast, all the work of adding records to collections happens serially inCreateRelationshipsJob
.One way to parallelize this would be to refactor
CreateRelationshipsJob
intoCreateRelationshipJob
. Move the logic fromCreateRelationshipsJob
'sperform
intoScheduleRelationshipsJob
'sperform
, and useCreateRelationshipsJob
'sprocess
asCreateRelationshipJob
'sperform
. EachCreateRelationshipJob
would only be responsible for adding the relationship for one record, not all the records in an import.Or, a new job called
AddToCollectionJob
could be written:then rewrite
CreateRelationshipsJob
to scheduleAddToCollectionJob
s inprocess
.This would speed up record imports significantly.