samvera / hyrax

Hyrax is a Ruby on Rails Engine built by the Samvera community. Hyrax provides a foundation for creating many different digital repository applications.
http://hyrax.samvera.org/
Apache License 2.0
184 stars 124 forks source link

Explore whether to embed files in file set or store file ids in file set #4229

Open elrayle opened 4 years ago

elrayle commented 4 years ago

Descriptive summary

In Valkyrie, there are two options for making an association between two resources:

  1. Store the id of other resource. In this case, all resources have their own rows in database
  2. Embed (nest) one resource in another. In this case, the parent resource has a row in the database and the children (nested) resources are stored in the same row with the parent.

This Issue explores the impact of these choices in Wings adapter and in Hyrax in general.

Impact on Wings

Abbreviations:

query_service.find_by(id: file_set)

NOTE: This assumes that VR FileSet with embedded FileMetadata expects an eager load of nested resources and that the nested resource is accessible in-memory from the VR FileSet (i.e. vr_file_set.files.each -- all files are already in-memory and ready to be processed)

persister.save(resource: file_set)

This is not substantially different performance wise.

custom_queries

find_file_metadata_id(id:) find_file_metadata_by_alternate_identifier(alternate_identifier:) find_many_file_metadata_by_ids(ids:) find_many_file_metadata_by_use(resource:, use:) where resource is the VR file_set

Needs more analysis.

tpendragon commented 4 years ago

One significant difference will be handling changes in persisting changes to the FileMetadata. You won't be able to just query for it, change it, and save it. With this strategy changing one FileMetadata node will probably save all the other ones in the Fileset (at least for now)

no-reply commented 4 years ago

One significant difference will be handling changes in persisting changes to the FileMetadata. You won't be able to just query for it, change it, and save it. With this strategy changing one FileMetadata node will probably save all the other ones in the Fileset (at least for now)

as discussed in slack, this is particularly an issue for Wings, which would need to query each FileMetadata independently in order to access the FileSet; i.e. it would create an N+1 query problem for FileSet access, over the number of FileMetadata objects.

there's a second question about whether the one-FileSet per File restriction is acceptable (discussion yesterday concluded loosely "yes").

and still a third question about whether it's acceptable for FileMetadata saves to necessitate FileSet saves. i'm less certain about this last one, leaning toward "it's not ideal, but may be worth it for the other benefits of the nesting model".

i'm stuck on the first issue though, and don't think we can seriously consider nesting without well considered benchmarks showing the N+1 issue to be a non-problem up to a "reasonable" number of FileSets. how many is "reasonable"? without telemetry data, i think our best bet would be to ask in slack, via email, and in Samvera Tech.