ipfs / go-cid

Content ID v1 implemented in go
MIT License
157 stars 47 forks source link

Suggestion to avoid exra layer of indirection by using a more compact representation of Cid #3

Closed kevina closed 6 years ago

kevina commented 8 years ago

The conversion from plain Key to Cid introduced an extra layer of indirection that can likely be avoided.

Currently Cid takes up the width of 5*64 bits. With it's current definition of:

type Cid struct {
    version uint64
    codec   uint64
    hash    mh.Multihash
}

May I suggest

type Cid struct {
    version uint32
    codec   uint32
    hash    mh.Multihash
}

As 32 bits should be more than enough for version and codec which I don't think will ever get very large (even 16 bits should be enough). If combined with multiformats/go-multihash#29 the size will now be 3*64 bits, which is the same size an array slice.

As the same size of a slice it becomes practical to pass it around by value which avoids an extra layer of indirection. Even though it is passed by value, the internals can still be kept hidden, as is done with many other go datatypes.

Just a suggestion. It might make a difference when you have an array of 1000's of Cids (which I might in some of my maintenance commands for the filestore, ipfs/go-ipfs#2634).

whyrusleeping commented 8 years ago

@kevina Yeah, this is a pretty good point. I'm not opposed to this.

kevina commented 8 years ago

Okay, this is something I will measure then. I will also measure the benefits (and cons) of using a string for a Multihash.

It may be a while before I can get to it as I can't just change the Cid form a pointer to a value type without lots of code change so I will need to isolate the benchmark code to have a few dependencies as possible before I can measure anything.

kevina commented 8 years ago

I did some informal test and determined that the overhead for the benchmark in ipfs/go-ipfs#3376 was around 17ms. Most of the time for that benchmark was in the base32 decoding (see whyrusleeping/base32#1). This might still be useful but nothing urgent.

Stebalien commented 7 years ago

So, I opened a competing issue (#25) because having to call KeyString() every time one wants to use a CID in a map is a pain (and slow). However, making multihash a string would also fix this issue.

Kubuxu commented 7 years ago

@Stebalien as I am updating GC code that relies heavily on the indexing by Cid having the Cid stored as string possibly could improve it quite a bit.

Stebalien commented 7 years ago

Taking a look at my heap profile, it looks like the KeyString function is now responsible for ~10% of my allocated memory.

kevina commented 7 years ago

@Stebalien making multihash a string is actually a different issue in a different repo, https://github.com/multiformats/go-multihash/issues/29. @whyrusleeping was concerned about the cost of converting the string to a []byte but maybe now its the cost of converting a []byte to a string that is really the problem.

Stebalien commented 7 years ago

Ah. I see. Either that or just storing CIDs in as byte strings (fully serialized) would work.

kevina commented 7 years ago

@Stebalien so if I understand you correctly, what you want is for the official CID representation to be a serialized string. That would be the most efficient in memory representative however they may be a non-trivial cost then if we need information on the Cid itself.

If there is any interest this is something I could look into.

Stebalien commented 7 years ago

My main concern is that I want it to be usable as a key in a map. We can achieve this by:

  1. Storing the multihash as a string and using the raw Cid (not a pointer to it) as a map key.
  2. Storing the encoded CID inside of the CID as a string (struct { version uint64, codec uint64, cidStr string }). KeyString would return this (slightly more space).
  3. Just using the bare CID as a string.
kevina commented 7 years ago

@Stebalien ah, now I see what you are after. In most of these cases we should stop passing around CIDs as pointers (see #38). I am not sure if the Cid type necessary should be directly storable as map keys, but we should definitely aim to reduce the cost of KeyString.

kevina commented 6 years ago

We are going with a string representation.