sul-dlss / happy-heron

Self-Deposit for the Stanford Digital Repository (SDR): H2 is a Rails web application enabling users to deposit scholarly content into SDR
Apache License 2.0
10 stars 2 forks source link

Research: How to know if files change when performing deposit #3485

Closed justinlittman closed 8 months ago

justinlittman commented 8 months ago

A user version requirement is that when a deposit is performed, if any files have changed since the previous deposited version, a new user version will automatically be created. (If no files have changed the user will be given the option of creating a new user version.)

Please research what approach might be used to determine if any files have changed since the previous deposited version.

lwrubel commented 8 months ago

While you can change the metadata for a file in H2 (e.g. label or visibility) it doesn't seem like H2 currently has the notion of a "changed file" where the file contents of an existing file have changed. To "change" a file, you delete the existing file and then upload a new one with the same name. This creates a new AttachedFile. The form to edit a work does not allow a file with the same name as an existing one to be uploaded.

If you're doing the file replacement via Globus, all existing AttachedFiles are first deleted.

With this current approach, it would be possible to tell whether there are "changed" files the same way you would check for new files, by checking the service_name of the WorkVersion's AttachedFile blobs and looking for those that do not have a service_name of preservation . The existing WorkVersion.staged_files method does this using ActiveStorage::Service::SdrService.accessible?(af.file.blob) .

To detect whether a file was "changed" by being deleted, you have to compare the current WorkVersion's attached files with the previous WorkVersion's.

amyehodge commented 8 months ago

@lwrubel What about additions to the list of files? Or whether files have been hidden or unhidden?

lwrubel commented 8 months ago

If a file has been added, then there would be a new AttachedFile. Replaced files currently look like new files.

If it hasn't already been discussed, I think we should talk about hiding/unhiding as part of the User Version requirements--would changing those file metadata fields mean a user version should be automatically created? While that's a metadata change, I'm guessing the user would expect that to be a new version.

amyehodge commented 8 months ago

@lwrubel When @andrewjbtw and I discussed changes to the files, I'm not sure we actually covered hiding/unhiding. Or changes to the file descriptions. To me hiding/unhiding seems a significant enough change to warrant the new version . I could go either way on descriptions, but since it's in the file upload section of the UI, it might be easier for users to understand as all part of "file changes."

andrewjbtw commented 8 months ago

I agree that hiding/unhiding makes a new user version because it does change what the citeable data contains.

lwrubel commented 7 months ago

Some thoughts on how to determine if files have changed on an H2 item and therefore a new User Version should be created:

If someone does a Globus or Zipfile upload, it should always create a new User Version.

  1. Files have been added: check the service_name of the WorkVersion's AttachedFile blobs and see if any do NOT have a service_name of preservation. The existing WorkVersion.staged_files method finds preserved files using ActiveStorage::Service::SdrService.accessible?(af.file.blob), so we want to reject those. This is the same for files that have been replaced, since that involves deleting the file and and uploading a new version.

  2. Files have been removed: Compare the WorkVersion's AttachedFile filenames with the previous H2 version's AttachedFile filenames. If there is a filename in the previous version that does not match a filename on the current work version, we could assume there was a deletion. The filename includes the path for hierarchical zip and globus deposits. Multiple copies of a file could be in different directories, so we need to use that instead of checksums.

  3. One or more files have been hidden or unhidden. Compare the hide field for each AttachedFile with the previous H2 version's AttachedFile hide field, matching on filename. Some deposits have hundreds of files, so possibly this comparison could be done until a change is found (meaning we need a new User Version) and then stop. If any files have been added or removed, we would not need to check for this since the deposit would already meet the criteria for being a new User Version.