samvera / hydra-pcdm

Samvera implementation of the PCDM model
Other
11 stars 10 forks source link

Reverse `collection has_members collections` to be `collection is_member_of` collections #243

Closed elrayle closed 6 years ago

elrayle commented 6 years ago

Descriptive summary

Reverse the relationship used for collections in collections. The current implementation is collection has_members collections. The new proposed replacement relationship is collection is_member_of collections. This avoids the need to reindex the collection object with every insert and removal of sub-collections.

Rationale

Previously, there was work to reverse the relationship for works to be work is_member_of collections. Since sub-collections were not in use at the time, the collection in collection relationship was not reversed. Nesting of collections is currently being implemented in Hyrax. Before this functionality is released, the relationship should be reversed to avoid performance issues and a need for a migration in the future.

Expected behavior

When Collection_B is added as a member of Collection_A, the following triple is added:

<Collection_B> <pcdm:is_member_of> <Collection_A> 

and Collection_B is reindexed.

Actual behavior

When Collection_B is added as a member of Collection_A, the following triple is added:

<Collection_A> <pcdm:has_members> <Collection_B> 

and Collection_A is reindexed.

Steps to reproduce the behavior

Via tests

  1. Create Collection_A and Collection_B
  2. Collection_A << Collection_B (may not be exact syntax)

Verify

NOTE: I believe I saw is_part_of in the solr doc instead of is_member_of. For all the predicates, verify against the previous reversal done for works.

Related work

https://github.com/samvera/hyrax/issues/1574 When I nest a collection, I need to run a reindex the relationships on that object.

jcoyne commented 6 years ago

This has a pretty big performance penalty for large collections, right?

escowles commented 6 years ago

The current behavior (<parentCol> pcdm:hasMember <subCol>) would result in the time to update parentCol with each new sub-collection growing linearly with the number of sub-collections. I wouldn't expect this to be an issue until a collection had hundreds of sub-collections, though.

jcoyne commented 6 years ago

@escowles thanks for explaining the details. Is this a use case we have to care about?

escowles commented 6 years ago

@jcoyne I don't know of anyone with hundreds of sub-collections, but I'd defer to the folks working on the collections sprint, who should know a lot more about the use cases.

elrayle commented 6 years ago

@jcoyne @escowles RE: performance - The current implementation incurs the penalty for large number of sub-collections. The proposed reversal is providing the same improvements that were gained when the collection to work relationship was reversed.

The downside of the reversal is the same as that for collection to works, in that, the reversed relationship does not support order (unless more work has been done to add that support with the reversed relationship).

RE: User Case - There are several organizations that are migrating DSpace Communities and Collections into Hyrax nested collections. I confirmed with @blancoj that they have hundreds of sub-collections. Our DSpace implementation has sub-collections in the hundreds as well. If we migrate, this will also benefit our implementation.

@aaron-collier Does this effect your use case? Or others you know of from the migrations working group?

elrayle commented 6 years ago

@cjcolvar's work in progress is at... https://github.com/samvera/hydra-pcdm/commit/549605515d3c91555e9750207d18cc91e5df85ef https://github.com/samvera/hydra-works/commit/9a5339852681fd0beb5ed2e43c2cc704244e4084

cjcolvar commented 6 years ago

I believe @escowles work before allowed for reversing the relationship but didn't remove the previous relationship so it was on the code using hydra-pcdm to choose which it preferred. I'm attempting to do the same for collection->collection relationships. If this goes through, wouldn't it make sense to do the same for object->object relationships by offering the option of reversing the relationship and having methods to collect all of one's parents/children of a particular type regardless of the direction of the relationship?

JimOttaviani commented 6 years ago

@blancoj referred me over here to add my thoughts as a repository manager: While we don't have 100s of subcollections in our DSpace instance (we top out at 6 right now, here: https://deepblue.lib.umich.edu/handle/2027.42/79040), it's easy to foresee a need for more than just 10s of these.

escowles commented 6 years ago

@cjcolvar Yes, I think supporting object-to-object relationships in both directions would make sense too — since you could have the same kinds of large numbers of references in one direction or another that would lead to performance problems.