jkunze / bagitspec

31 stars 11 forks source link

Don't allow multiple incomplete manifests (was: Make explicit that incomplete manifest are permitted) #6

Closed stain closed 5 years ago

stain commented 9 years ago

3.4 says

  1. Every file in every payload manifest MUST be present.
    1. Every payload file MUST be listed in at least one manifest. Payload files MAY be listed in more than one payload manifest.

so does that mean it is perfecly valid to have payload files mentioned in only some manifests? E.g.

manifest-md5.txt

e2b25c29051b9d6f0ae7ef5b12a3251b data/file1.txt

manifest-sha1.txt

24f5c8113eade58f5d5fa1b38fbbe1076ad3408a data/file2.txt

That means it is impossible to validate a bag without going through ALL the manifest-* files - for nothing else but to check that there is not a spurious extra file in any of the manifest files that the consumer is not able to checksum against. (at least the fileformat of the file is given :)

What is the purpose of allowing multiple manifest files? I would assume it would be to allow consumers to pick one they support and validate the content of the bag - rather than for producers that update an existing bag without caring about existing manifest files. If this is the case, then I would hope for all manifests to be complete and list the same payload files (but not necessarily in the same order.)

This applies as well to the tag manifests, if present.

edsu commented 9 years ago

Checking all the manifests present is important since the characteristics of checksum algorithms are not the same. If someone goes through the effort to provide more than one, I think it's correct to require checking them all before saying the bag is valid.

stain commented 9 years ago

Why should it be correct to list one payload file in one incomplete manifest, and another file listed in a second incomplete manifest? So there is not a single manifest that lists the payload? What if I just want to know all the payload files?

If I receive a bagit with manifest-md5.txt and manifest-sha512.txt, but say I'm in Windows 10 and thought it was an achievement just to be able to check md5s (Actually Get-FileHash can do pretty much anything) - yet now I would still have to parse manifest-sha512.txt as well to see if there are any spurious files not mentioned in manifest-md5.txt (which I don't have the implementation to check), just to know if the bag is complete or not and to build a complete file listing that I can present to the user.

edsu commented 9 years ago

I'm a little bit confused by your example. What would said implementation do if a bag only contained a manifest-md5.txt?

One reason for providing multiple manifests is that hashing algorithms have different characteristics: speed, collisions, etc. Perhaps a piece of software is doing a quick audit of a bag (randomly picking payload files to check) and md5 is good enough. But when it comes to validating a bag it seems fair to require a BagIt implementation check all the manifests.

I honestly don't remember why creating a union of the manifests is required in order to know what payload files are present in the bag. I wonder if one or more of {@ardvaark, @justinlittman, @andyboyko, @jkunze, @dbrunton} can remember? Perhaps Postel's Law was invoked...

I agree that it would simplify clients that aren't doing validation, and just want a list of the payload files, if they could expect a given manifest to contain a full listing. But at this stage I don't think this is a strong enough reason to potentially invalidate existing bags that don't do this.

justinlittman commented 9 years ago

One of our early uses of BagIt was for National Digital Newspaper Program content. In our environment we were extracting existing fixities from METS which used one algorithm and creating fixities for additional files in another algorithm (our default). Hence the union.

Ardvaark commented 9 years ago

That is to say, we were constructing a bag using metadata from multiple inputs, each with different fixity algorithms, and we wanted to take a use-what-you-have approach that didn't require us to recompute the missing fixities, very much in keeping both with BagIt's minimalist approach and our own inherent laziness as programmers.

Ardvaark commented 9 years ago

And with that said, I vote to close this issue as NTBF, both because it was intentional design choice, and it would retroactively invalidate otherwise valid and validatable bags, both imagined and real.

edsu commented 9 years ago

:+1:

stain commented 9 years ago

Ok, so if consensus is that it is deliberately allowed to have those kind of "split" manifests with some files only in some manifests, then I will close this.

I assume a given BagIt profile can still be stricter about this :)

stain commented 9 years ago

I think I'll Reopen this - I think if split manifests are deliberately allowed, then this should be explicitly mentioned in the specification. That is, something like:

Note: Individual payload manifest files does not necessarily list every payload file, even if the bag is considered complete, which only requires that every file file in the payload is mentioned in at least one manifest file. This means that the complete list of payload files can only be found by inspecting all manifest files.

The complete list of payload files of a complete bag can be found by concatenating all manifest-*.txt files and extracting the file path, skipping any duplicates by comparing the byte value of the file paths.

Note that as it is not a requirement for every tag files to be mentioned in tag manifests, a similar complete list cannot be produced for tag files.

...I suggest byte-value comparison rather than Unicode normalization of filenames. This is possible because of the bag encoding that sets a uniform encoding for all tagfiles. However with bag encodings like utf-8 it is possible to have multiple "similar" filenames, like :+1:

e2b25c29051b9d6f0ae7ef5b12a3251b data/première.txt
9051b9d6f0ae7ef5b12a3251be2b25c2 data/première.txt

.. but this should be addressed in a separate issue - some filesystems like OS X will always normalize filenames, therefore a bag like the above would immediately become invalid on unpacking.

Note: This comment does itself suffer from broken Unicode normalization at GitHub, as première.txt above might be rendered with ' over the r rather than the e.

edsu commented 9 years ago

I thought it was explicitly mentioned in the specification. That's how you learned about it!

stain commented 9 years ago

It is just implicit:

complete A bag which comprises all elements required by this specification, with all files listed in all payload and tag manifests present, all payload files present listed in at least one manifest. See Section 3.

  1. Every file in every payload manifest MUST be present.
  2. Every payload file MUST be listed in at least one manifest. Payload files MAY be listed in more than one payload manifest.

You have to be more of a logician to see from this that "Aha, so that mean I am allowed to NOT list a file in SOME of the manifests."

If you are simply want to consume or parse BagIt, then you could quickly misread this so that all payload files must be listed, there should be at least one manifest (misread as "oh, so they MUST be listed there then") - and then simply glance over the fact that this implicitly permits any or all of the manifest to be individually incomplete.

The last sentence, "Payload files MAY be listed in more than one payload manifest" can be read in two ways:

johnscancella commented 7 years ago

@stain with the new proposal it does indeed require that ALL payload files be listed in ALL manifests https://github.com/jkunze/bagitspec/pull/17

justinlittman commented 7 years ago

Hey @johnscancella - Are you sure the NDNP bags won't be broken by requiring ALL payload files be listed on ALL manifests?

johnscancella commented 7 years ago

@justinlittman yes, because they are an old version and will follow that specification. But going forward you will have to list all files in all manifests.

stain commented 6 years ago

I think my original issue will be solved by #19 having "Every payload manifest MUST list every payload file exactly once." which I am quite pleased with. As for backwards compatibility this will presumably be BagIt-Version: 1.0 or so?

I'll close this when merged.

acdha commented 6 years ago

@stain That's my advice to implementers and the test case in our compliance suite has it as invalid only for 1.0: https://github.com/LibraryOfCongress/bagit-conformance-suite/tree/master/v1.0/invalid/notAllManifestsListAllFiles

stain commented 5 years ago

Closing as implemented in https://tools.ietf.org/html/rfc8493#section-2.1.3

  • Every payload manifest MUST list every payload file name exactly once