sul-dlss / pre-assembly

Rails app - prepares objects for assembly workflow and allows discovery report
https://consul.stanford.edu/display/chimera/Automated+Accessioning+and+Object+Remediation+%28pre-assembly+and+assembly%29
Other
1 stars 2 forks source link

Add validation to prevent name collisions between folder/file names and the druid name #1507

Closed andrewjbtw closed 3 months ago

andrewjbtw commented 3 months ago

Related to https://github.com/sul-dlss/cocina-models/issues/732

Support for both the old and new Stacks file layouts means that we can't allow the root content folder of a deposit to contain a folder or file whose name is identical to the item's druid. We should catch this in Preassembly so that items do not get caught in accessioning.

My suggestion is that we follow the strategy already in place to prevent using file hierarchies with non-file content types. I don't remember exactly how that was implemented but it prevents an item from starting accessioning if it breaks the validation rule for hierarchical files. I don't think this is a Cocina-level validation.

Note that for preassembly, most users stage their content using this pattern:

.
|--manifest.csv
|--[druid]
|     |--content_file1
|     |--content_file2

That staging file layout uses the druid as the name of the folder that serves as the container used to carry files into the SDR. This is permitted because that "container" is discarded and only the content files within the container are added to the druid.

What we need to prevent would look like this:

.
|--manifest.csv
|--[druid]
|     |--[druid]

In that layout, the container has a folder (or file) named for the druid within the content itself. If this item got shelved, it would create a name collision.

justinlittman commented 3 months ago

For consideration: Generate cocina early so that it is validated and catches problems.

andrewjbtw commented 3 months ago

To test this:

  1. Use the following structure to stage the files:
.
├── manifest.csv
└── xs951nf4814
    ├── file.txt
    └── xs951nf4814
        └── hello.txt

(In this example the druid is the name of a folder within the content but the outcome is the same whether it is a folder or a file.)

  1. Run Preassembly.

  2. See that Preassembly shows an error.

  3. Check the /dor/assembly filesystem. If files are left behind then it means Preassembly ran part way before the cocina validation kicked in:

/dor/assembly/xs/951/nf/4814
└── xs951nf4814
    ├── content
    │   ├── file.txt
    │   └── xs951nf4814
    │       └── hello.txt
    └── metadata
aaron-collier commented 3 months ago

@andrewjbtw should no files or folder include the druid? Is druid.pdf OK for example? Or is it just folders that shouldn't have the druid?

Thanks.

andrewjbtw commented 3 months ago

There's no problem with a folder or filename including the druid, as long as the name includes more than just the druid. So druid.pdf, druid.tif, etc. are all ok.

The problem is specifically when a file or folder name exactly matches the druid. The type of failure is different when it's a file or a folder, but both cases should be disallowed.

andrewjbtw commented 3 months ago

It's unlikely for someone to name a file just druid with no extension, but if they do it's a problem.

aaron-collier commented 3 months ago

Thanks for the clarification, that helps :-)