Open stain opened 9 years ago
A fetch.txt
raison d'être is to complete the bag. By implication, then, the files in the fetch.txt
must be in one of the manifests, and be computed as part of the Payload-Oxum
in order for the bag to ever have a chance of validating.
OK, can we make this explicit? I was able to make a 'complete' bag according to the Python library, that had many entries in fetch.txt for data/filenames which were missing, simply by not listing those files in the manifest nor Paylot-Oxum.
If a file is listed in fetch.txt
, under the data/
payload folder, but is not listed in any of the manifests, then the implication is that this file can be ignored, as the bag would be complete without it.
For tag files this is different, as tag manifests are not required to be present, and even if they are present, they are not required to be complete.
I second this and related issues concerning handling of remote files.
Unfortunately, this lack of explicit specification around the requirements and usage of fetch.txt results in some confounding implementation issues when trying to incorporate remote files into bags.
I cannot find a single BagIt implementation that correctly handles the programmatic addition of non-local payload files to the manifest(s) during bag creation, which results in tedious (and in some cases impossible) workarounds for users who wish to create bags with remote file components and also have those bags be compliant with the specification.
Currently, the only way to create spec-compliant bags with the existing implementations is to first have every remote file present in the local filesystem payload directory where the bag is being created (in order to properly create the manifest and Payload-Oxum), and then delete those files from the payload directory without updating the manifest. This is not practical when dealing with large or otherwise non-relocatable remote files.
Getting things clarified in the spec could go a long way to getting the various implementations updated. I have forked http://libraryofcongress.github.io/bagit-python and am in the process of modifying it to be more "fetch-friendly", but would much rather see this addressed in a larger scope.
@Ardvaark, @jkunze: Could we please get some additional comments on this and all fetch-related issues filed by @stain?
Also see the first comment in the thread here: https://groups.google.com/forum/#!topic/digital-curation/oiZvGqhs80E in which I think @stain explains the problems with the programmatic use of fetch.txt pretty clearly.
It is unclear from the spec if files in
fetch.txt
towards thedata/
directory must be included in themanifest-*
files or not. Asfetch.txt
permits-
for undefined file size, my first interpretation is "no" - but that means that if you in that case try to complete the bag by downloading fromfetch.txt
, then the bag would go from valid incomplete to invalid - which is a bit odd.It is unclear if file sizes from
fetch.txt
should be included in the calculations ofbag-info.txt
properties likeBag-Size
(bag being transferred),Payload-Oxum
. Are "Payload files" only the files that actually exist within thedata/
folder, or does that include thefetch.txt
payload files?I understand
fetch.txt
files may also be tagfiles - but the spec allows for tagfiles to not be listed in the tag manifests - so this question is only relevant for fetch files todata/
.