ipfs / kubo

An IPFS implementation in Go
https://docs.ipfs.tech/how-to/command-line-quick-start/
Other
16.05k stars 3.01k forks source link

Improved Reprovider.Strategy for entity DAGs (HAMT/UnixFS dirs, big files) #8676

Open lidel opened 2 years ago

lidel commented 2 years ago

@aschmahmann @petar I remember we discussed this a while ago, as a low-hanging fruit for bigger data providers like Pinata, but was unable to find an issue, so created this one.

Improving provider strategies was previously discussed in: https://github.com/ipfs/go-ipfs/issues/6221, https://github.com/ipfs/go-ipfs/issues/5774, https://github.com/ipfs-inactive/package-managers/issues/84. In this issue I want to propose a well-scoped improvement of codec-aware strategy that could be shipped without refactoring the entire system.

TLDR

Problem statement

Right now, we support three values in Reprovider.Strategy which tells reprovider what should be announced. Valid strategies are:

If the repository gets too big, all and pinned are too expensive and folks are forced to use roots which is codec-agnostic and will only announce the root block of UnixFS DAG.

This means in case of big UnixFS datasets, the user has to write additional orchestration code to go the extra mile and manually pin every file withing a bigger DAG, and make sure those sub-pins are removed when the entire DAG is no longer needed.

Proposed solution: codec-aware (UnixFs) strategy

Depending on a codec, different blocks may have different importance. In case of UnixFS the important blocks are manifest (root) blocks of directories and files. Sub-blocks of individual files with the data itself are not as critical as those manifest blocks. It is CID of manifest block that is looked up on DHT first.

A big data provider may want to opt-in to codec-aware strategy as "best-effort" way to provide something on DHT rather than nothing: in case of UnixFS only provide these manifest blocks on the DHT, facilitating initial lookup without the cost of announcing all the sub-blocks.

Open questions

Jorropo commented 2 years ago

Potential fix would be to do DHT lookup not only for a specific sub-block in a file, but also for the first UnixFS root block above them (either a root of a file, or a parent directory). Rationale being, if someone has the root of a file, they most likely have the rest.

This seems reasonable. (I was actually writing this before I've fully red your message.)

The biggest issue with this is that we unofficially create a special case for unixfs files as a single independent entities and that make it harder to create new interessting cross files features in the future.

Two options that I would like to have would break with such thing:

What I would like to see.

I would like to see some priority system. Advertising all CIDs is expensive and only usefull in certain rare scenarios or scenarios that doesn't even exists yet. I think if we could layer strategies that would be nice. So my node would burn full speed at 1200% cpu until all directories and root of files are published which would take a minute hopefully. And then go the a throttled mode at 200% where it will publish all cids in the next 3 hours or so.

lidel commented 9 months ago