jkunze / bagitspec

31 stars 11 forks source link

Multiple URLs in fetch.txt for same file? #9

Open stain opened 9 years ago

stain commented 9 years ago

In one use-case for handling Big Data (tm) with BagIt, we've been discussing if it's valid to list the same file multiple times in fetch.txt from different locations, e.g.:

fetch.txt

http://www.example.com/bigfile.txt 1099511627776 /data/bigfile.txt
http://cdn.example.org/bigfile.txt 1099511627776 /data/bigfile.txt
ftp://ftp.example.com/pub/bigfile.txt 1099511627776 /data/bigfile.txt
gsiftp://grid.example.com/store5/bigfile.txt 1099511627776 /data/bigfile.txt
magnet:?xt=urn:sha1:YNCKHTQCWBTRNJIV4WNAE52SJUQCZO5C 1099511627776 /data/bigfile.txt

Reading the spec I don't see how this is invalid (except perhaps the magnet link not being a URL, just a URI).

I think this could be quite powerful - should this be explicitly permitted? Obviously the choice of which one to use would have to be down to the client, falling back to top-first or something.

Ardvaark commented 9 years ago

Interesting use case for a sort of fallback transfer system: "Try here, then here, then here..."

Still, though, the spec ought to say anything at all about how to handle this situation, other than it should be allowed. Implementors are free to then choose any sort of handling of the fetch.txt they'd like, including such things like "in-order, first wins" or "random" or "error".

stain commented 9 years ago

Yeah, I would hope for the spec to explicitly mention this - so a naive client doesn't download the same 20 GB multiple times.

The current spec permits URLs, but does not reference what specification of URL you mean - so although all the examples are with http:// presumably any kind of URL scheme could be used.

stain commented 6 years ago

So I hope we can just allow explicitly multiple entries of the same path? Assuming that the path is listed explicitly with a checksum in a manifest then it should not really matter from where it is fetched.

paulmillar commented 1 year ago

Perhaps to reanimate this discussion, I can add some comments here.

First, adding support for multiple URL is something I am interested. I would have opened an issue on this topic, if this issue did not already exist.

The particular use-case I have in mind isn't to support error-recovery (try this, then try that), but rather because our community is using the Globus transfer service. This provides optimised transfers, but requires that the destination install some Globus-specific software. This is freely available; nevertheless, it presents a barrier (and isn't always practical). Therefore, it would be helpful if the fetch.txt file could contain two links per file: one is a Globus URL and the other is a generic HTTP link. We should be able to use the scheme to distinguish between them (i.e., "https://" vs "globus://").

Second, Metalink, as defined in RFC 5854, is a standard format for describing how to download a collection of files. Therefore, I believe there is some overlap between Metalink and fetch.txt. Metalink allows each file to have multiple URLs. It also includes per-URL attributes such as physical location (country) and a numerical priority (a recommended "try" order).

I'm not mentioning Metalink as something BagIt should adopt, but rather as a possible source of inspiration when updating BagIt to support multiple URLs.

HTH, Paul.