samvera / hyrax

Hyrax is a Ruby on Rails Engine built by the Samvera community. Hyrax provides a foundation for creating many different digital repository applications.
http://hyrax.samvera.org/
Apache License 2.0
184 stars 124 forks source link

WINGS: convert AF FileSet Files into FileMetadata in Valkyrie Resource FileSet #3940

Closed elrayle closed 1 year ago

elrayle commented 5 years ago

Descriptive summary

Handle the conversion of files from ActiveFedora to Valkyrie.

Background

Creation of Files

ActiveFedora

AF:Files are created by Hydra::Works::AddFileToFileSet. This creates and attaches the AF:File to a passed in AF:FileSet.

Valkyrie Resource (Wings specific)

AF:Files are created by Wings:Works:AddFileToFileSet service called from the ActiveFedora storage adapter called from FileMetadataBuilder called from the FileActor. This creates and attaches the AF:File to the AF:FileSet converted from the passed in Valkyrie:Resource FileSet.

File Metadata

Active Fedora

The file metadata is stored either on the AF:File (e.g. original_name, mime_type, contents, etc.) or on the File's metadata_node (e.g. all metadata set by Fits)

Valkyrie Resource

All file metadata is stored in the Hydra::FileMetadata valkyrie resource. Currently, this is not part of the conversion process and does not get created. It is part of the focus of this Issue.

File Relationships

Active Fedora

files, original_file, extracted_text, and thumbnail are relationships on the AF fileset defined by...

    directly_contains_one :original_file, through: :files, type: ::RDF::URI('http://pcdm.org/use#OriginalFile'), class_name: 'Hydra::PCDM::File'
    directly_contains_one :thumbnail, through: :files, type: ::RDF::URI('http://pcdm.org/use#ThumbnailImage'), class_name: 'Hydra::PCDM::File'
    directly_contains_one :extracted_text, through: :files, type: ::RDF::URI('http://pcdm.org/use#ExtractedText'), class_name: 'Hydra::PCDM::File'

Valkyrie Resources

  attribute file_ids, Valkyrie::Type::Set.of(Valkyrie::ID)
  attribute original_file_ids, Valkyrie::Type::Set.of(Valkyrie::ID)
  attribute extracted_text_ids, Valkyrie::Type::Set.of(Valkyrie::ID)
  attribute thumbnail_ids, Valkyrie::Type::Set.of(Valkyrie::ID)

Current Conversion Process for FileSets

Currently when converting the AF fileset to a valkyrie resource, the file relationshipes are set in the resource such that...

file_ids = af_fileset.files.map(&:id)
original_file_ids = [af_fileset.original_file.id]
extracted_text_ids = [af_fileset.extracted_text.id] # not copied for some reason
thumbnail_ids = [af_fileset.thumbnail.id]

The code for this is spread across several locations in Wings::ModelTransformer.

The relationships are added to the list of resource attributes in the relationship_keys_for method. This transforms the attribute names from the AF properties by singularizing and appending '_ids' to the property names.

They get added to the FileSet resource class through the relationship_keys variable in the to_valkyrie_resource_class method.

The values get set in the new FileSet resource by AttributeTransformer where it converts the name back to the AF FileSet property and then calls that method on the AF FileSet to get the value, and in this case, transforming it into the id before setting it in the resource attribute.

Potential Conversion Process for Files

Goal for Wings:

GOAL from Resource to AF:File

If given a Valkyrie Resource FileSet, create an AF:FileSet and establish the connection between the AF:FileSet and an AF:File for each file. (Will be the focus of a later issue.)

GOAL from AF:File to Resource

If given an AF:FileSet, create a Valkyrie Resource FileSet and establish the connection between the Valkyrie FileSet and a Hydra:FileMetadata resource for each file. (Is the focus of this issue.)

Two potential paths forward for converting files.

OPTION 1: On the fly

When a FileSet is converted, the ids of the file relationships are stored on the FileSet. At this point, there is no FileMetadata resources. When a file or its metadata is needed, a separate conversion process will be kicked off to...

PRO:

CON:

OPTION 2: Embedded resource

When a Valkyrie Resource FileSet class is generated, the file attributes will be defined as embedded resources instead of a set of ids...

  attribute files, Valkyrie::Type::Set.of(Hydra::FileMetadata)
  attribute original_files, Valkyrie::Type::Set.of(Hydra::FileMetadata)
  attribute extracted_texts, Valkyrie::Type::Set.of(Hydra::FileMetadata)
  attribute thumbnails, Valkyrie::Type::Set.of(Hydra::FileMetadata)

When an AF FileSet is converted, for each file...

PRO:

CON:

Analysis

There is a time trade off of paying the time cost upfront during FileSet conversion for the FileSet and all its Files vs. potentially converting the Files multiple times with each access which in the long run could be more costly.

Related work

PR #3939 wings wrapper for Hydra:Works:AddFileToFileSet

no-reply commented 3 years ago

is this fixed?

hackartisan commented 1 year ago

Depending on how file sets and file metadata are modeled, nested resources (https://github.com/samvera/hyrax/issues/3662) may become relevant

no-reply commented 1 year ago

i think a helpful first step on this ticket would be to try to reproduce the behavior in "Current Conversion Process for FileSets" above.

is this still current? (it's not clear to me what's negative about the described behavior) are the goals listed in "Goal for Wings:" met? if not, is there value in meeting them?

tpendragon commented 1 year ago

This is a doozy. My understanding is that this ticket boils down to "where does pcdm:Use get saved for each file in both Wings (ActiveFedora) and not-wings"

Some very confusing spelunking says "in Wings if using Fedora to store binaries it stores a file_set#original_file_id which Hydra::PCDM::File.find knows how to resolve to a PCDM object that then knows how to become a valkyrie object", "in Wings if not using Fedora it stores FileMetadata nodes as their own resources", and "in not-wings it stores FileMetadata nodes as their own rows in the table."

I don't know what the success criteria is. Close this ticket until there's an identified bug? I suspect this is going to be a problem for migration but frankly the logic's so interconnected I have a hard time following and I wrote a chunk of it.