Open lidel opened 2 years ago
Potential fix would be to do DHT lookup not only for a specific sub-block in a file, but also for the first UnixFS root block above them (either a root of a file, or a parent directory). Rationale being, if someone has the root of a file, they most likely have the rest.
This seems reasonable. (I was actually writing this before I've fully red your message.)
The biggest issue with this is that we unofficially create a special case for unixfs files as a single independent entities and that make it harder to create new interessting cross files features in the future.
Two options that I would like to have would break with such thing:
Content based chunking.
Let's assume I add a .car
archive to IPFS (you might think "that dumb just add the blocks", but no this is meant for extra support, my pinning service doesn't support a fancy DAG format that I want to use), so I make a .car
archive of my blocks and chunk it perfectly to the block blobs in the .car
using raw leaves then when I want to download it, I use multihash addressed requests (v0.12.0
blockstore update).
So the downloader thinks it is downloading a dag-turbo-3000
object, the pinning node thinks it serve dag-unixfs
-> raw-leaf
, but both agree because in the end their hashes match.
However with this, the pinning service would announce what it thinks is the true root (root of the .car
) while the downloader would search for the root of the dag-turbo-3000
(which the pinning service does has, just it thinks it's a borring raw-leaf
).
Delta adds.
We could add a --delta=<CID>
option to add (or make it a standalone thing, the details are not important).
This would use a chunking strategy that would assume that all blocks in --delta
are free and would try to reuse them as much as possible.
This would make for cheap incremental updates (note, that would not be that good because we would be limited to blocks, more advanced deltas are capable to pick variable size and arbitrary offsets are far more efficient, but also more expensive to compute and atrocious if you are trying to unthread a very long chain of deltas).
Let's assume I download a new version of my app.
90% of the blocks are actually the same as previously, but there is 10% that is new.
We can assume that a lot of people already serve the old version, but not much from the new.
I would have issue finding nodes serving the old version even tho most of the blocks I can find since they would announce the old root CID and I would search the new. (note I assume the node downloading doesn't already own the original delta cid)
I would like to see some priority system. Advertising all CIDs is expensive and only usefull in certain rare scenarios or scenarios that doesn't even exists yet. I think if we could layer strategies that would be nice. So my node would burn full speed at 1200% cpu until all directories and root of files are published which would take a minute hopefully. And then go the a throttled mode at 200% where it will publish all cids in the next 3 hours or so.
Reprovider.Strategy: entities
which only announces the minimal set of blocks required for enumeration. For a file or DAG-CBOR document, that will be a single root blocks. For HAMT-sharded Unixfs directory, it would be the hamt blocks.
Improving provider strategies was previously discussed in: https://github.com/ipfs/go-ipfs/issues/6221, https://github.com/ipfs/go-ipfs/issues/5774, https://github.com/ipfs-inactive/package-managers/issues/84. In this issue I want to propose a well-scoped improvement of codec-aware strategy that could be shipped without refactoring the entire system.
TLDR
Problem statement
Right now, we support three values in
Reprovider.Strategy
which tells reprovider what should be announced. Valid strategies are:If the repository gets too big,
all
andpinned
are too expensive and folks are forced to useroots
which is codec-agnostic and will only announce the root block of UnixFS DAG.This means in case of big UnixFS datasets, the user has to write additional orchestration code to go the extra mile and manually pin every file withing a bigger DAG, and make sure those sub-pins are removed when the entire DAG is no longer needed.
Proposed solution: codec-aware (UnixFs) strategy
Depending on a codec, different blocks may have different importance. In case of UnixFS the important blocks are manifest (root) blocks of directories and files. Sub-blocks of individual files with the data itself are not as critical as those manifest blocks. It is CID of manifest block that is looked up on DHT first.
A big data provider may want to opt-in to codec-aware strategy as "best-effort" way to provide something on DHT rather than nothing: in case of UnixFS only provide these manifest blocks on the DHT, facilitating initial lookup without the cost of announcing all the sub-blocks.
Open questions