jkunze / bagitspec

31 stars 11 forks source link

How to handle empty directories #16

Open nkrabben opened 7 years ago

nkrabben commented 7 years ago

Current BagIt libraries differ in how they handle empty directories added as a payload. For example, with the following sample structure, |-directory\ | |-empty_directory\ | |-payload_file ... bagit-java (and Bagger) and bagins create the following data directory, |-data\ | |-payload_file ... and bagit-python creates, |-data\ | |-empty_directory\ | |-payload_file ...

Both are valid according to the spec, but I think that dropping the empty directory is not expected behavior. The spec has a suggestion in 2.1.3 to a zero-length file with the same name as the directory and keep as an extension (pointed out to me by @andrewjbtw), but I haven't seen any bagging tools follow this suggestion. Andrew and I do not see this as an expected behavior.

Can the specification be more explicit about how to handle empty directories? I prefer bagit-python's strategy, but realize that this means empty directories have no bearing on the completeness or fixity of a bag.

acdha commented 7 years ago

Since the 02 draft (~July 2008) the spec has clarified that empty directories cannot be stored:

https://tools.ietf.org/html/draft-kunze-bagit-02#section-6

The behaviour you're seeing in bagit-python is probably something we should warn about: the empty directory isn't referenced in the manifest but it's “preserved” during the bag creation process because bagit-python does just an atomic move to the data directory, and it's not reported in validation because that only looks at filenames. We should at the very least start reporting empty directories during validation to avoid surprises if the bag is copied somewhere else using a manifest-driven tool rather than a simple filesystem copy.

nkrabben commented 7 years ago

My issue is two-fold.

  1. Can the spec have stricter language about how empty dirs should be handled, so I can refer to that documentation when creating issues with the various bagging tools.
  2. Is dir_name.keep the strategy that should be required by that stricter language?
justinlittman commented 7 years ago

One consideration is maintaining the compatibility between the manifest format and tools like md5sum and md5deep.

johnscancella commented 7 years ago

@justinlittman I would be curious how many people manually create a bag using md5sum or md5deep today? If would seem with so many readily available bagit tools, if you manually create a bag, it is on that person to ensure it is compatible with the bagit specification

justinlittman commented 7 years ago

I'd suggest that it is less for bag creation, then bag validation. md5sum/md5deep are readily available on most *nix platforms and allow checking a bag without installing any BagIt software. Furthermore, the existing manifest format is readily recognizable to most technical folks, even if they have no knowledge of BagIt.

I know in the past, I've reached for md5sum/md5deep for quick bag checking.

andrewjbtw commented 7 years ago

This adds complexity, but would it be possible to add a separate file in the data directory representing the file/folder tree and then validate that as a separate step?