sul-dlss / cocina-models

Cocina repository data model (implemented in Ruby)
https://sul-dlss.github.io/cocina-models/
3 stars 0 forks source link

New field for storing when a user uploaded a single ZIP with hiearchy in an object #547

Closed peetucket closed 2 years ago

peetucket commented 2 years ago

In sul-dlss/happy-heron#2843, we will be adding an option to allow users to upload a single ZIP file for expansion on the server providing users access to individual files. Since we need to know this option was selected in systems outside of H2 in order to drive display behavior in Argo and sul-embed, we may need to store this in the structural or other part of the cocina model.

The value will be at an the object level.

Engineer questions:

justinlittman commented 2 years ago

@peetucket One possible implementation is that the unzipping happens entirely within H2; thus H2 is the only system that needs to know if a zip should be extracted. Is there any other accessioning application that needs zip functionality?

peetucket commented 2 years ago

Yes, that is one option. We can discuss as a team.

Basically I think the two options are:

  1. Expand in H2 and accession individual files. No knowledge needed past H2.
  2. Do not expand in H2, accession a single ZIP file all the way through to preservation and then teach the access systems how to allow access to the files contained within it. Knowledge needed throughout so the access systems know when to do this, since we don't necessarily want to do automatically for anything that is a ZIP file.

Andrew has a preference for option 2, which is how I am currently writing up the tickets, though there will be pros and cons of each. I will modify tickets as the POs and engineers provide feedback.

jcoyne commented 2 years ago

How will the derivatives for the files in the zip get created so that it can be used by the access systems? How do we deal with 2 zips that both have files with the same names inside of them?

peetucket commented 2 years ago

Not sure how internals would be accessed, perhaps this is not feasible without doing something like expanding the ZIP on the access systems. Could be an area of investigation to see if option #2 is even really an option. But the idea is that only a single ZIP would be allowed to prevent name clashes.

jcoyne commented 2 years ago

We already have objects that have more than one zip file in the repository: https://argo.stanford.edu/view/druid:cw226nt8831

justinlittman commented 2 years ago

Option 2 is significantly more complex and likely to have unintended / unexpected consequences. Are there use cases to support the preferred implementation?

peetucket commented 2 years ago

We already have objects that have more than one zip file in the repository: https://argo.stanford.edu/view/druid:cw226nt8831

Understood - but in the proposed implementation, this would not be allowed for specific objects (validated at the H2 level).

Option 2 is significantly more complex and likely to have unintended / unexpected consequences. Are there use cases to support the preferred implementation?

We will need more input from @andrewjbtw (and @amyehodge ). Part of the concern in asking for this apporach may have been with what happens if we have a ZIP that has many thousands of files and/or large content causing problems in accessioning, but this is not a user requirement but rather something that we could work out in the implementation.

andrewjbtw commented 2 years ago

Maybe I misunderstood the conversation, but I thought allowing more than one zip was a reason to use zip over direct folder upload, where uploading multiple folders could get messy. We were going to prohibit uploading more than one folder.

justinlittman commented 2 years ago

We've passed the threshold for a ticket conversation; sounds like this needs a meeting.

peetucket commented 2 years ago

Closing as no longer needed - changing approach to how ZIPs are accessioned via H2.