Default 5 digit prefix for flatfs is no longer sufficient for CidV1

kevina commented 7 years ago

When we switch to Cidv1 the assumptions we built into the 5 (base 32) digit prefix:

    // 5 bytes of prefix gives us 25 bits of freedom, 16 of which are taken by
    // by the Qm prefix. Leaving us with 9 bits, or 512 way sharding

are no longer correct, in fact when raw leaves are used most of the blocks will be clustered in the AFZBE directory. See ipfs/go-ipfs#3462. There are several ways to fix this:

One way to fix this is to increase prefix length, but then we we be back to the problem we had before where we will have lots of directories with just a few files in them (when CidV0 is used).
Another alternative is proposed in ipfs/go-ds-flatfs#5. That is use the last N digits (which should likely be a 2 digit suffix, which will give us 1024 way sharding).
Finally, and the most work, it to stick with using a prefix but implement a dynamic prefix width.

I think (2) would be the best alternative until (3) can be implemented.

kevina commented 7 years ago

@whyrusleeping I can likely implement the last two digits quickly, the dynamic solution will likely take more work and I don't think it should be for 0.4.5.

kevina commented 7 years ago

Also as a followup from the sprint meeting: The flatfs datastore doesn't even implement range queries, so I don't think this will be an issue at all. Once we figure out how to properly implement something dynamic than we can revisit the issue of range queries.

Kubuxu commented 7 years ago

I agree with that dynamic solution might be too complex to tackle for 0.4.5.

jbenet commented 7 years ago

the problem we had before where we will have lots of directories with just a few files in them

How big of a problem is this? has anyone quantified this with a graph on read latency overhead? particularly for a node under lots of activity?

I dont think we should do (2). you're shifting the problem to an unconventional thing, without telling users that this has shifted. i may be fine if we add a file called .ipfs/blocks/_README that says something like:

This is a repository of IPLD objects. Each IPLD object is in a single file,
named `<hash-of-object>.ipld`. All the object files are placed in a tree
of directories, based on a function on the object hashes. This is a form 
of sharding similar to the objects directory in git repositories. Previously, 
we used prefixes. Now, we use suffixes. The exact function is:

    func FilePath(objectHash string) {
      return string.Split(string(objectHash[objectHash.length - 6 : ]), "/")
    }

For example, an object that hashes to:

    <hash of object>

Will be placed at

    a/b/c/objhashabc.ipld

the example should be real
the function should be correct (i sketched it)

I think this file o/ would make me feel ok about changing the function that figures out the location of the content.

jbenet commented 7 years ago

We should add such a file to signal anyway.
- [ ] add file that describes what's going on.
I think (1), then (3). but fine with (2), then (3).

kevina commented 7 years ago

@jbenet, the read latency overhead is really dependent on the filesystem. In addition some file systems (such as ext4 when dir_index is enabled) start to have a problem when the directory gets too large. I will get some numbers for you on why (1) would be a problem now (and wasn't before) in a bit.

jbenet commented 7 years ago

its fine-- just go with (2) and add that file.

kevina commented 7 years ago

@jbenet, okay will do. Thanks!

kevina commented 7 years ago

@jbenet @whyrusleeping should we just modify go-ds-flatfs to use suffix or create a new version and call it maybe go-ds-flatfs-suffix? (Modifying it too do both will likely create a lot of a special case code I would like to avoid).

whyrusleeping commented 7 years ago

@kevina make the sharding selector a function so we can specify on the creation of the flatfs datastore. This will make migrations so much easier for me.

kevina commented 7 years ago

@whyrusleeping All right I will see if I can make that work.

I am also willing to handle most of the migration.

kevina commented 7 years ago

See https://github.com/ipfs/go-ds-flatfs/pull/6 for a start. The README will be generated by go-ipfs not flatfs as it is go-ipfs specific.

kevina commented 7 years ago

Because we use Base32 encoding the very last byte/digit is not quite as random as I would hope and also depends on the length of the binary representation:

Length of Key in Bits	Bits in last byte
8	3
16	1
24	4
32	2
40	5
...	...

When converting the repo created for #3462 that has a mixture of CidV0 and CidV1 raw leaves a 2 digit suffix I got me 256 levels of shading (instead of the expected 1024) and I didn't check but the distribution is unlikely to be even. With additional Cid types this number could go up to 1024 directories depending on the length of the key.

Increasing the suffix to 3 digits will increase the number of directories to 8192, but that could go up to 32768 directories.

Another possibility (that will be easy for me to implement based on what I got now), is disregard the very last digit and take the next to last 2 digits, then we get an evenly distributed 1024 way sharding. For example if the key is AFZBEIEDZ5V42DLERJG7TJUHH7M25VY5INXFMAVXXX7N3KRRTQCZACUCAY the directory used will be CA, (instead of AY or CAY).

@whyrusleeping what do you think.

Sorry for this oversight.

kevina commented 7 years ago

Okay I ran some simulations using the keys from the repo created in #3462. I confirmed that using a fixed prefix just won't work when using both CidV0 and CidV1 keys, to make it long enough to handle CidV1 you end with lots of directories with only a few files in them when CidV0 is used. The following options could work:

How	Data Set	Num Dirs	Max Dirs
Suffix length 2	CidV0	128	1024
Suffix length 2	CidV0 and Raw Leaves	256	1024
Next to Last 2	CidV0	1024	1024
Next to Last 2	CidV0 and Raw Leaves	1024	1024
Suffix length 3	CidV0	4096	32768
Suffix length 3	CidV0 and Raw Leaves	8092	32768

The distribution in all cases is reasonable even (I was wrong in my initial guess) in all cases.

Here is the list of keys used and various results: https://gateway.ipfs.io/ipfs/QmNbugVGQDcMpDYezAELasskUWyPms9JAjFBRYuJcZtqFC

kevina commented 7 years ago

I would either go with the suffix length of 2 or use the next to last 2 digits. A suffix length of 3 will create two many directories for an average sized repo, especially if they are keys that have a length that is a multiple of 5 binary bytes (in which case there could be up to 32768 directories).

We could also make the suffix length a configurable option with the default value being 2. I am not sure what additional complications this will bring though.

whyrusleeping commented 7 years ago

Ah, so the last character in the base32 encoded keys don't contain an entire 5 bits of data. Given thats the case, lets go ahead and use a suffix length of three.

kevina commented 7 years ago

@whyrusleeping are you sure? That can lead to a large numbers of directives with only a few a few files per directory for an average sized repo. If you don't like my next-to-last idea, I would stick with 2 or make it configurable.

whyrusleeping commented 7 years ago

yep

Kubuxu commented 7 years ago

Other idea is, how about hashes of the CID, with some fast or extremely fast hashing algorithm. Fast:

Murmur3
Blake2b/s

Extremely fast, http://www.cse.yorku.ca/~oz/hash.html:

djb2
sdbm

kevina commented 7 years ago

I am not sure I like the idea of hashing again @Kubuxu as it would make it difficult for users to manually find where a key is. I would still prefer the next to last 2 digits as that will give a perfect 1024 way fanout no matter what the key is and can still make it possible to manually find the key. I can live with the last 3 digits though.

Kubuxu commented 7 years ago

Using last digits (or N from last M) will already be quite hard so I don't think that manual user usability is the first priority.

kevina commented 7 years ago

@Kubuxu, by inspecting the base 32 representation of the key, for example:

AFZBEIEDZ5V42DLERJG7TJUHH7M25VY5INXFMAVXXX7N3KRRTQCZACUCAY

I can easily determine that the last 3 digits are CAY and the next-to-last 2 digits are CA I don't see what is hard about that. If we use a hash function than I will have no idea what directory this key in in unless I first computing the hash or use something like find to search for it.

Kubuxu commented 7 years ago

I am not trying to argue, just showing that there is an option, which might be worth exploring if we hit the same issue again. Each migration is small hiccup for users.

whyrusleeping commented 7 years ago

@Kubuxu we cant do hashes of the cids. how would ipfs refs local work?

Kubuxu commented 7 years ago

Hashes of the CID would be only to choose the directory container for flatfs, inside this containers block would have its original hash as a name.

robcat commented 7 years ago

@Kubuxu

Other idea is, how about hashes of the CID, with some fast or extremely fast hashing algorithm. Fast:

Murmur3 Blake2b/s Extremely fast, http://www.cse.yorku.ca/~oz/hash.html:

djb2 sdbm

Wait a moment: if you accept that the directory name can be an arbitrary function of the CIDv1 (and not just a slice of its b58 encoded CIDv1), the obvious candidate should be this function:

take the last k bits of the binary CID (getting a k^2-fold sharding)
encode them in a filesystem-friendly format (e.g. hex)

This seems the obvious solution, since CIDv1 and multihash specifications already guarantee that the last bytes are part of a hash.

Kubuxu commented 7 years ago

Yes but keys are encoded already in base32 due to how datastore API works, which means that randomness of last N bytes depends on length of the input.

kevina commented 7 years ago

@robcat that is another idea that @whyrusleeping suggested over IRC. It would involve decoding the base32 encoding taking that last binary bits and then reencoding them. So @Kubuxu it can work and I would say that is better than using another hash function. However, like the hash function it is non-obvious on inspection what directory the block would be in.

whyrusleeping commented 7 years ago

Alright, we need to come to a decision here today.

I like the idea of using the next to last two digits more than the idea of rehashing or doing extra encoding work. Everyone seems pretty set against using the last three (though i really don't think that it will be a problem).

@Kubuxu What do you think of using the next to last two?

whyrusleeping commented 7 years ago

also cc @diasdavid because he is interested in changes to flatfs

Kubuxu commented 7 years ago

I am not against last X or next to last X, but if next to last X gives us better distribution I think we should go for it.

daviddias commented 7 years ago

Just for clarification, when the word digit is used, what we mean is char, correct?

Hashing again to increase the distribution would bite us in terms of inspection and would probably force us to add a multihash in from of each block to say which second hash is used, therefore changing the blocks.

Dynamic prefix width would be that we would have to have a way to signal that, or to do several reads.

AFZBEIEDZ5V42DLERJG7TJUHH7M25VY5INXFMAVXXX7N3KRRTQCZACUCAY I can easily determine that the last 3 digits are CAY and the next-to-last 2 digits are CA I don't see what is hard about tha

Why not just pick the last 3 if the last char is not as random? Does this create 'too much' sharding?

Update from @whyrusleeping:

The reason for not using 'last 3' is because it could result in having 32k directories under the blocks dir its not not random its not uniformly random

I would really appreciate that this PR followed a proposed changed in the spec and even better, provide in the spec repo some repo examples with a bunch of blocks in CIDv0 and CIDv1 space that you use for tests so that other implementations can do too.

Kubuxu commented 7 years ago

Second hashing would be only choosing which directory-shard the key falls in, there in no need for multihash there (it can be changed by migration or be flat-fs specific).

chooseShard(key) = djb2(key)[:2]

or something

Lengths and shards, see: https://github.com/ipfs/go-ipfs/issues/3463#issuecomment-265668675

The flat-fs sharding is not speced out anywhere.

kevina commented 7 years ago

@diasdavid yes by digit I mean a char in a base 32 representation sorry for the confusing terminology. I was trying to avoid byte to distinguish it from a byte in the raw binary representation.

daviddias commented 7 years ago

The next to last doesn't seem that a thing that would be unreasonable to ask for contributors to understand, it just needs to be on the spec :) Consider this my 👍

kevina commented 7 years ago

Okay it sounds like we are in agreement, use the next-to-last 2 characters to get a nice guaranteed 1024 level fanout no matter what type of Cid is used. @whyrusleeping agree?

I will update the code and also update the spec.

whyrusleeping commented 7 years ago

@kevina sounds good to me, ship it

jbenet commented 7 years ago

Last characters still hurt inspection-- when working with the hashes and dirs manually. But fine by me. I'll deal.
BTW, i think we want a larger fanout, may be a good time to do that too. I think very large repos see pain today. How big is the current fanout tuned for?

Separately, it would be nice to make sure the sharding is good for repos of any size. having a constant amount of sharding is bad because repos of different sizes will perform poorly. I think the right solution starts small then auto-scales up and "rearranges all the files" when necessary. this is a whole separate discussion though. For The Future.

jbenet commented 7 years ago

Wait, I didn't catch why "the next-to-last 2 chars" instead of "the last 2"-- what was the reason?

Kubuxu commented 7 years ago

base32 has different count of entropy bits in last byte depending on the length of the input as it converts 5 bytes of input to 8 bytes output. So if you hash is 101 bytes long then last byte will be have only 4.8 bits of entropy, last two bytes will have 12.8 bits instead of 16bits.

kevina commented 7 years ago

@jbenet, so that we are on the same page, the current Fanout should be 512, if we implement this as planned it will be 1024.

See https://github.com/ipfs/go-ipfs/issues/3463#issuecomment-265647047 and my followup comments for the justification for next-to-last. Without it we won't get consistent shading.

I strongly disagree with having a very large fanout by default. For the average size repo that could lead to lots of directories with only one or two files per directory (like we had before we switched to base32 encoding). I have no problem making the fanout configurable for the uncommon case of huge repos.

jbenet commented 7 years ago

Ah right. nice.

I think the fanout should adjust automatically. but we can get there.

Another option is to make it 2 tiers by default, instead of one.

kevina commented 7 years ago

@jbenet I'm confused. Are you in support of a configurable option? That should be fairly easy to implement.

whyrusleeping commented 7 years ago

@jbenet this change discussion is a fairly temporary one to make sure that we don't end up with horrible performance between now and when we can come up with a better block storage solution.

I'm okay moving forward with the next to last two given the discussion in this thread so far.

whyrusleeping commented 7 years ago

Alright, lets move forward with 'next to last two'.

With that, i want two things. I want the README file as @jbenet suggested to be generated by the codebase (filling in a template). If you like, I can write up roughly what i think this should look like, but the gist of it is you should show an example echo foobar | ipfs add to get a hash, then show the series of operations it takes to go from that hash, to a block on disk.

We should also add a file with a multicodec that describes the format, so in this case /repo/flatfs/shard-next-to-last/2 or something (@Kubuxu may have input here). This way you can open up a flatfs datastore without having to know what its sharding format is beforehand.

Then, the conversion tool should be able to handle the required changes to both of those files.

kevina commented 7 years ago

@whyrusleeping I will create a README and work on updating the spec.

We should also add a file with a multicodec that describes the format

I am unclear what you are after here.

whyrusleeping commented 7 years ago

Make sure the README file gets autogenerated based on the flatfs sharding method being used.

For the multicodec, have a file called VERSION or SHARDING or something in the flatfs directory, next to the README that contains a multicodec (a string) that tells us what sharding function is being used there. That way, we can 'open' an existing flatfs datastore without having to have the user specify the correct sharding function

kevina commented 7 years ago

@whyrusleeping, autogenerating the README was not what I had in mind and will take some additional work. In addition it may be difficult to include all the information @jbenet wants in an auto-generated README, mainly the code for the sharding function in use.

kevina commented 7 years ago

Here is a draft of the README as I believe @jbenet wanted it: https://gist.github.com/kevina/e217dd4c763aaaafdab9657935920da5

kevina commented 7 years ago

For the version file I am going to go with a file name SHARDING and am going to use the string

v1/next-to-last/2

The v1 is to allow for an upgrading when we do something more complicated than a simple function. If you want to prefix it with something maybe

/repo/flatfs/shard/v1/next-to-last/2

ipfs / kubo

Default 5 digit prefix for flatfs is no longer sufficient for CidV1 #3463