Closed elrayle closed 7 years ago
This has a pretty big performance penalty for large collections, right?
The current behavior (<parentCol> pcdm:hasMember <subCol>
) would result in the time to update parentCol
with each new sub-collection growing linearly with the number of sub-collections. I wouldn't expect this to be an issue until a collection had hundreds of sub-collections, though.
@escowles thanks for explaining the details. Is this a use case we have to care about?
@jcoyne I don't know of anyone with hundreds of sub-collections, but I'd defer to the folks working on the collections sprint, who should know a lot more about the use cases.
@jcoyne @escowles RE: performance - The current implementation incurs the penalty for large number of sub-collections. The proposed reversal is providing the same improvements that were gained when the collection to work relationship was reversed.
The downside of the reversal is the same as that for collection to works, in that, the reversed relationship does not support order (unless more work has been done to add that support with the reversed relationship).
RE: User Case - There are several organizations that are migrating DSpace Communities and Collections into Hyrax nested collections. I confirmed with @blancoj that they have hundreds of sub-collections. Our DSpace implementation has sub-collections in the hundreds as well. If we migrate, this will also benefit our implementation.
@aaron-collier Does this effect your use case? Or others you know of from the migrations working group?
I believe @escowles work before allowed for reversing the relationship but didn't remove the previous relationship so it was on the code using hydra-pcdm to choose which it preferred. I'm attempting to do the same for collection->collection relationships. If this goes through, wouldn't it make sense to do the same for object->object relationships by offering the option of reversing the relationship and having methods to collect all of one's parents/children of a particular type regardless of the direction of the relationship?
@blancoj referred me over here to add my thoughts as a repository manager: While we don't have 100s of subcollections in our DSpace instance (we top out at 6 right now, here: https://deepblue.lib.umich.edu/handle/2027.42/79040), it's easy to foresee a need for more than just 10s of these.
@cjcolvar Yes, I think supporting object-to-object relationships in both directions would make sense too — since you could have the same kinds of large numbers of references in one direction or another that would lead to performance problems.
Descriptive summary
Reverse the relationship used for collections in collections. The current implementation is
collection has_members collections
. The new proposed replacement relationship iscollection is_member_of collections
. This avoids the need to reindex the collection object with every insert and removal of sub-collections.Rationale
Previously, there was work to reverse the relationship for works to be
work is_member_of collections
. Since sub-collections were not in use at the time, the collection in collection relationship was not reversed. Nesting of collections is currently being implemented in Hyrax. Before this functionality is released, the relationship should be reversed to avoid performance issues and a need for a migration in the future.Expected behavior
When
Collection_B
is added as a member ofCollection_A
, the following triple is added:and
Collection_B
is reindexed.Actual behavior
When
Collection_B
is added as a member ofCollection_A
, the following triple is added:and
Collection_A
is reindexed.Steps to reproduce the behavior
Via tests
Collection_A
andCollection_B
Collection_A << Collection_B
(may not be exact syntax)Verify
Collection_B.is_member_of
includesCollection_A
(may not be exact syntax)<pcdm:is_member_of> <Collection_A>
is in the Fedora objectCollection_B
is_member_of field includesCollection_A
NOTE: I believe I saw
is_part_of
in the solr doc instead ofis_member_of
. For all the predicates, verify against the previous reversal done for works.Related work
https://github.com/samvera/hyrax/issues/1574 When I nest a collection, I need to run a reindex the relationships on that object.