jkunze / bagitspec

31 stars 11 forks source link

Are files from fetch.txt part of the payload? #8

Open stain opened 9 years ago

stain commented 9 years ago

It is unclear from the spec if files in fetch.txt towards the data/ directory must be included in the manifest-* files or not. As fetch.txt permits - for undefined file size, my first interpretation is "no" - but that means that if you in that case try to complete the bag by downloading from fetch.txt, then the bag would go from valid incomplete to invalid - which is a bit odd.

It is unclear if file sizes from fetch.txt should be included in the calculations of bag-info.txt properties like Bag-Size (bag being transferred), Payload-Oxum. Are "Payload files" only the files that actually exist within the data/ folder, or does that include the fetch.txt payload files?

I understand fetch.txt files may also be tagfiles - but the spec allows for tagfiles to not be listed in the tag manifests - so this question is only relevant for fetch files to data/.

Ardvaark commented 9 years ago

A fetch.txt raison d'être is to complete the bag. By implication, then, the files in the fetch.txt must be in one of the manifests, and be computed as part of the Payload-Oxum in order for the bag to ever have a chance of validating.

stain commented 9 years ago

OK, can we make this explicit? I was able to make a 'complete' bag according to the Python library, that had many entries in fetch.txt for data/filenames which were missing, simply by not listing those files in the manifest nor Paylot-Oxum.

stain commented 8 years ago

If a file is listed in fetch.txt, under the data/ payload folder, but is not listed in any of the manifests, then the implication is that this file can be ignored, as the bag would be complete without it.

For tag files this is different, as tag manifests are not required to be present, and even if they are present, they are not required to be complete.

mikedarcy commented 8 years ago

I second this and related issues concerning handling of remote files.

Unfortunately, this lack of explicit specification around the requirements and usage of fetch.txt results in some confounding implementation issues when trying to incorporate remote files into bags.

I cannot find a single BagIt implementation that correctly handles the programmatic addition of non-local payload files to the manifest(s) during bag creation, which results in tedious (and in some cases impossible) workarounds for users who wish to create bags with remote file components and also have those bags be compliant with the specification.

Currently, the only way to create spec-compliant bags with the existing implementations is to first have every remote file present in the local filesystem payload directory where the bag is being created (in order to properly create the manifest and Payload-Oxum), and then delete those files from the payload directory without updating the manifest. This is not practical when dealing with large or otherwise non-relocatable remote files.

Getting things clarified in the spec could go a long way to getting the various implementations updated. I have forked http://libraryofcongress.github.io/bagit-python and am in the process of modifying it to be more "fetch-friendly", but would much rather see this addressed in a larger scope.

@Ardvaark, @jkunze: Could we please get some additional comments on this and all fetch-related issues filed by @stain?

Also see the first comment in the thread here: https://groups.google.com/forum/#!topic/digital-curation/oiZvGqhs80E in which I think @stain explains the problems with the programmatic use of fetch.txt pretty clearly.