Tracking issue for UnixFS automatic unsharding

aschmahmann commented 3 years ago

Tracking implementation of #7022

Continuation of #8106

Per the linked issue(s) in order to enable sharded directories by default we want to shard the directories when required. However, we don't just want to automatically shard directories that would be too big, but also unshard directories that would be too small.

The reasoning here is that it is more intuitive to users if the CID of a directory is the same irrespective of how elements were added/removed from it. While it is almost always a bad idea to try and recreate a CID of a DAG from the data that made it up (as I mentioned in https://github.com/ipfs/go-ipfs/issues/7022#issuecomment-832178501), there are some use cases where people may want to do something like https://github.com/ipfs-shipyard/ipfs-pack and therefore want a way to specify exactly how a DAG was created from some other format (e.g. files + directories). If we didn't support automatic unsharding at the same cutoff as the automatic sharding then the MFS (ipfs files) commands would be unusable for those users.

A possible implementation strategy is suggested by @Stebalien is in https://github.com/ipfs/go-ipfs/issues/8106#issuecomment-834719686.

cc @schomatis

schomatis commented 3 years ago

Given the sharding implementation based on size already landed (though not bubbled to go-ipfs yet) the unsharding part is pretty straightforward. The only thing that merits some discussion is how to track the estimated size of the directory to decide when to switch to sharding and (in this case) back. What is needed is an explicit and documented decision about what use-case/scenario/user are we targeting. This is not a technical issue (in terms of impact in the code base or internal considerations on implementation) but only about the use case, so I'll just leave the broad options here and defer to @aschmahmann (or whoever he thinks it's appropriate).

Size estimation. Here and in the code size is a loose reference to the estimation the byte size the directory has when serialized. The current sharding implementation does this by aggregation of link (directory entry) size (more precisely, its name and CID byte lengths). The exact formula of the estimation is not relevant for the decision needed here, we just note that we estimate size based on links. This means the size changes only when we add or remove an entry in the directory (both entry points are already identified in the current implementation).

Note on HAMT size: the above is a fair estimation of the BasicDirectory's size but not necessarily of the HAMTDirectory's. Since the threshold to decide on (un)sharding is unique no matter which "direction" are we considering the basic directory's estimation is the one used throughout.

The broad ends of the spectrum to do the size computation can be described as:

NAIVE: Every time the size changes (entry added/removed) we enumerate all the entries (links), compute their sizes and aggregate. Note we do not do this; it's described here just for the sake of illustrating the available possibilities and how they compare to each other.
CACHE/UPDATE-ALWAYS: We keep an internal cache variable in the UnixFS structure of the directory implementation and every time the size changes we just update it and check its value to see if we should transition. This is what the current implementation does, just that we have restrained the cache to the BasicDirectory because we only cared about its value before transitioning to the HAMTDirectory (since we never went "the other way": unsharding if the size went below the threshold).
CACHE/UPDATE-SOMETIMES: This can be considered the "in-between" variation where we sometimes keep the cache up-to-date and others we enumerate from scratch. The variations of course are endless and depend on the use case. One suggestion from https://github.com/ipfs/go-ipfs/issues/8106#issuecomment-834719686, for example, is to keep track (once a HAMTDirectory) of only negative changes in the size and enumerate the links in that case (the rationale being that the threshold will only be crossed going down the size and we don't care how big a directory can get otherwise). All this depends again on the use case, if we believe directories will usually grow (batch of many additions) with only sporadic removals (triggering the enumeration) this might make sense.

My recommendation without much idea of the underlying use case is to keep the CACHE/UPDATE-ALWAYS strategy from the current implementation (enhancing it now to track the size also as HAMTDirectory when crossing the threshold) because:

It is the easiest non-naive implementation available and also the one currently in use (making its expansion very straightforward).
Even if it adds a penalty to all entry additions/removals it does so in a scattered fashion, in contrast with the penalty of enumeration that might trigger a significant delay in an unexpected moment (from the user's standpoint). Also the penalty of the cache update is very small compared with DAG/HAMT manipulation (the worst-case of this strategy is only when considered aggregated throughout the entire lifespan of the UnixFS structure, but not concentrated in a single operation).

aschmahmann commented 3 years ago

CACHE/UPDATE-SOMETIMES as suggested in the linked comment seems better than updating always. This is because if you have a large sharded directory and only need to operate on a small part of it you're now stuck loading all of the HAMT shards over network or from disk, both of which could be quite expensive. Limiting the expensive operations to only happen on removals, and in those cases to only happen sometimes seems like a good tradeoff to me.

schomatis commented 3 years ago

I'm fine going this way but just to be totally sure we have a shared understanding let me ask a few basic questions. First some clarifications I think were missing from my previous writeup:

The basic trade-off being discussed here is between enumerating all links sometimes and updating the cache variable always (with the link being added/removed).
Enumeration means a lot of fetches as you say and it's by far the most expensive.
Updating the cache happens with information available in the call itself (link name and CID). But having a cache in the first place means we enumerated the links at the directory startup but only once.

1.

Limiting the expensive operations to only happen on removals

By this we mean enumerating all links each time an entry is removed, right? The CACHE/UPDATE-ALWAYS alternative will enumerate only at the start and never again (for the duration of the UnixFS structure lifespan).

2.

and in those cases to only happen sometimes seems like a good tradeoff to me.

Could you expand on this trade-off please?

warpfork commented 3 years ago

On motivations: I'd like to suggest a much simpler and stronger lens, and a piece of vocab, for framing and understanding the value of this:

Does the data structure have Hysteresis or not?

In other words, does the structure vary based on its history, or is it determined purely by its content at the current point in time?

This is relevant because one of the key selling points of HAMTs is that they do not have hysteresis.

Therefore it's extremely odd if we were to make any choice that's directly adjacent to how the HAMT is used which throws away one of its primary virtues. (If we throw away hysteresis, then there's a lot of other algorithms that could potentially outperform a HAMT anyway. We've been betting on HAMTs for a long time largely because of their non-hysteresis.)

Working through the user stories and aiming to minimize the number of surprises in end user experiences when re-importing of data does also produce pretty similar justifications. I find this "hysteresis" keyword also helps a lot in clarity coming from the engineering side up: it's a property, and it's one that we can lose. Compositions of algorithms which do not lose this property are valueable because they remain more composable and because they minimize user surprise.

schomatis commented 3 years ago

Yes, we've been using the 'Hysteresis' term informally but feel free to push any doc in the code to reflect that (as it seems the UnixFS/MFS API implicitly promises this). Because of my personal background I'm very in favor of including this term as mainstream.

Note that hysteresis needs to be properly qualified though. We do not care so much about the determinacy through time of the data structure implementing the Directory interface so much as the resulting CID of its root node (at least for the requirement of this issue).

schomatis commented 2 years ago

Done.

ipfs / kubo

Tracking issue for UnixFS automatic unsharding #8148