Open Winterhuman opened 2 years ago
Right now,
ipfs pin add
is actually faster thanipfs get
at fetching data
To be more precise ipfs pin add
maybe be faster at download an entire DAG than ipfs get
is. However, if you started asking about time to get the first X bytes of a UnixFS file you might start to see different results since ipfs get
tries to get the data for you largely sequentially (although IIRC it also does some limited amount of prefetching) instead of just asking for "all of it".
You may want to try locally experimenting with adjusting this constant to see how it performs for you to see if it's just a matter of how much we're willing to prefetch https://github.com/ipfs/go-ipld-format/blob/d2e09424ddee0d7e696d01143318d32d0fb1ae63/navipld.go#L85.
IIUC some of the history here relates to how garbage collection could/should work, which is that the contract on pinning is that your data won't get delete but there is no such contract on ipfs get
. This means that if you run ipfs get
while already at your GC limit you could run into problems with having your blocks GC'd while you're downloading. Because ipfs get
pipes data out sequentially (with some prefetching) risks around GC are much lower. An extreme version of this question is if you do ipfs get <1TB-file>
and you have GC enabled with a 10GB repo cap should your repo grow to 1TB, probably not.
Ah I see, thanks for explaining! I do think this is still a UX problem though, people won't think to use ipfs pin add
to fetch large amounts of content; most will use ipfs get
since it's the most intuitive sounding. Perhaps some mentions in the ipfs get --help
and ipfs pin add --help
pages would help.
Also, maybe the preloadSize
option should be exposed so you can do ipfs get --preload 20 QmFoo...
?
Also, maybe the
preloadSize
option should be exposed so you can doipfs get --preload 20 QmFoo...
? @Winterhuman
Interresting idea as a bandage mesure.
To be more precise ipfs pin add maybe be faster at download an entire DAG than ipfs get is. However, if you started asking about time to get the first X bytes of a UnixFS file you might start to see different results since ipfs get tries to get the data for you largely sequentially (although IIRC it also does some limited amount of prefetching) instead of just asking for "all of it". @aschmahmann
I do not really understand this, is this meant to explain the current implementation details or user expectations ?
Personally my user expectation is that ipfs cat
is for this kind of partial downloads, while ipfs get
is for full downloads.
Also I don't see how you would comunicate a partial download to the get api, with a cancel ?
IIUC some of the history here relates to how garbage collection could/should work, which is that the contract on pinning is that your data won't get delete but there is no such contract on ipfs get. This means that if you run ipfs get while already at your GC limit you could run into problems with having your blocks GC'd while you're downloading. Because ipfs get pipes data out sequentially (with some prefetching) risks around GC are much lower. An extreme version of this question is if you do ipfs get <1TB-file> and you have GC enabled with a 10GB repo cap should your repo grow to 1TB, probably not. @aschmahmann
Fair however this is an unrelated side effect of using ipfs pin
, this could be solved by renaming this issue to implement parallel dag traversal in ipfs get
.
The difference is maybe better described as buffered vs non-buffered. You can make cat or get buffered by piping it to anything like less, dd, etc.
Get doesn't necessarily mean display though whereas cat is more interactive thus implicit. If you're cat'ing a file you are likely (or at least should), be expecting data in chunks. The default read() blocksize in linux at last appears to be 65k.
while ipfs get is for full downloads.
ipfs get
doesn't output to your filesystem it sends responses over the HTTP API from the daemon to the client. In order for you to get purely parallel dag traversal here it seems like you'd end up needing to a) buffer all of the data b) have the daemon send back data in some format other than tar that allows for sending the data back out-of-order such that the client can write arbitrary blocks to disk. @Jorropo are you recommending one of these options or did you have something else in mind?
Fair however this is an unrelated side effect of using ipfs pin, this could be solved by renaming this issue to implement parallel dag traversal in ipfs get.
It's sort of related in that if you wanted to go the route of buffering all of the data you start having to answer questions about where to buffer the data that aren't really necessary when pinning. For example: are you going to buffer in memory and risk OOM, are you going to buffer on disk and allow all your data to be GC'd in the middle (so you'd have to redownload it), are you going to buffer the data on disk but protect it from GC even though the user hasn't explicitly asked for it to be pinned, etc.?
What about protecting ipfs get
data from GC, waiting until the amount of data reaches the GC threshold, and then, pausing the fetch and asking the user "CID has exceeded the GC threshold, do you want to continue? (y/n)" before either continuing or cancelling (with some sort of --exceed-gc
option to say yes beforehand).
And what if you're not in an interactive context @Winterhuman? ie a script.
That's what the --exceed gc
option is for, to assume the answer is "yes" ahead of time ("with some sort of --exceed-gc option to say yes beforehand")
@Jorropo are you recommending one of these options or did you have something else in mind?
@aschmahmann if I had to implement it myself, ipfs get
would call ipfs dag export
under the hood and rebuild the directory structure in the client process.
Obviously ipfs dag export
would need to have parallel traversal implemented there too, but this sounds far easier to do.
Any protocol that would support out of order dags could be used, it doesn't need to be using cars, it's just we have a format that already do it so we shouldn't try to reinvent the wheel.
I get that this approach is not at all how the CLI framework is planned to work but I think it's valid to deviate from thoses rules in that case.
It's sort of related in that if you wanted to go the route of buffering all of the data you start having to answer questions about where to buffer the data that aren't really necessary when pinning. For example: are you going to buffer in memory and risk OOM, are you going to buffer on disk and allow all your data to be GC'd in the middle (so you'd have to redownload it), are you going to buffer the data on disk but protect it from GC even though the user hasn't explicitly asked for it to be pinned, etc.?
I don't see why. The only thing you need to keep alive is whatever are your current heads, once you fully listed through something, you can forget about that block (at least not touch it anymore so a future GC can remove it).
Checklist
Description
Right now,
ipfs pin add
is actually faster thanipfs get
at fetching data sinceipfs pin add
does concurrent dag traversals and preheating. Ideally,ipfs get
should also use this same method so that both are equally fast, it's not very intuitive that "pinning" is faster than "getting", many users probably don't know about the performance difference.