ipfs / go-cid

Content ID v1 implemented in go
MIT License
157 stars 47 forks source link

cid implementation roundup #70

Closed warpfork closed 6 years ago

warpfork commented 6 years ago

It's been discussed in several issues and PRs already that we might want to explore various ways of implementing CIDs for maximum performance and ease-of-use because they show up extremely often. Current CIDs are pointers, which generally speaking means you can't get one without a malloc; and also, they're not particularly well-suited for use in map keys.

This branch is to attempt to consolidate all the proposals so far (https://github.com/ipfs/go-cid/issues/38, https://github.com/ipfs/go-cid/pull/64, https://github.com/ipfs/go-cid/pull/47, more?) -- and do so in a single branch which can be checked out and contains all the proposals at once, because this will make it easy to do benchmarks and compare all of the various ways we could implement this in one place (and also easier for humans to track what the latest of each proposal is, since they're all in one place). (I ended up re-writing much of it because some of the existing work was a pain to cherry-pick apart, either because it was stale and had conflicts with latest master or because it mixed other topics like multibase, but this does borrow heavily from @dignifiedquire and @kevina 's code on the other branches.)

There are two/three major implementations here: one as struct (familiar), one as typedef of string, and one as an interface (implemented by the other two).

I composed a README.md with a summary of what I learned. Some of it became obvious just from thrashing the code around; some of it is supported with benchmarks (also included in the diff).

There's no attempt here to change the implementation of multihash... however, it's likely that doing so could further shift some of the performance numbers quite a bit. Skim the alloc counts in the benchmarks for impressions.

This whole hullabaloo is compartmentalized under a _rsrch directory... which means it won't typically be built or tested by e.g. go test ./... unless you're already in the _rsrch directory. It also does not change any of the existing packages. Therefore, it is completely safe to merge to master as sheer documentation of our research, even if we choose not to pursue any of these refactors at this time.

Hopefully this is good further food for thought and can be iterated on to get us closer to a committal choice about this stuff. (Or at least merged as research, rather than abandoned to drift as an out-of-sync branch. I dunno if you can tell, but those are my least favorite thing.)

warpfork commented 6 years ago

As a result of this, I'm mostly coming around to being a pretty vigorous advocate for type Cid string.

It might be nice to have a type CidInterface interface{...} at the same time... but its usage should be minimized. In particular, I think map[CidInterface]* is a really bad idea and should be avoided at all costs. (func CanonicalizeCid(CidInterface) Cid would be a fine thing to have, though.)

kevina commented 6 years ago

In particular, I think map[CidInterface]* is a really bad idea and should be avoided at all costs.

I agree and it was never my intention to propose such usage. That is what CidString is for. It is an implementation of a Cid as a string for use in maps.

Stebalien commented 6 years ago

Let's do it!

kevina commented 6 years ago

@Stebalien

Let's do it!

What?

I would like a chance to review these benchmarks.

In addition, there where several things about @dignifiedquire that I did not like and changed when I implemented my version of it in #65. We do not have to go with an interface, but I would like my changes to the String representation to go in.

We should really also change Multihash to be a string if a Cid is string, otherwise we we be creating a lot of needless alloc. We should also consider enhancing the multibase interface to (in addition) provide functions that take a string so that we can avoid an alloc there, but I see that as a lower priority.

I would be happy to take charge of this.

Stebalien commented 6 years ago

Let's do it!

What?

We've been grumbling about this for ages and it's a significant usability (can't use CIDs in maps) and performance (allocation/conversion) issue.

We should really also change Multihash to be a string if a Cid is string, otherwise we we be creating a lot of needless alloc.

I agree.

but I would like my changes to the String representation to go in.

Which changes?

I would be happy to take charge of this.

Let's finish the in-progress endeavors first. Are they blocked on something?

kevina commented 6 years ago

Which changes?

There in #65. I separate out the String stuff from just the interface stuff.

Stebalien commented 6 years ago

Ah, carrying the multibase with the Cid? I've commented on that PR. I understand that it's a solution to the CID base issue but I'm far from convinced that it's the right one (or even a palatable one).

kevina commented 6 years ago

Ah, carrying the multibase with the Cid? I've commented on that PR. I understand that it's a solution to the CID base issue but I'm far from convinced that it's the right one (or even a palatable one).

I will create a new P.R. with just the string stuff, stay tuned.... see #71

kevina commented 6 years ago

@warpfork I would be interested in how much the extra allocations cost us when we have to get the multihash in the string implementation. Also see #71 which is a continuation and optimization of #47. In particular it avoids an alloc when calling Type() which could be significant depending on how the Cid is used.

whyrusleeping commented 6 years ago

I think theres no reason not to merge this, ya? Let's just do so and use the information in the benchmarks to move forward

warpfork commented 6 years ago

Okay, so I'm gonna get ready to hit the merge button. Reminder, from a purely workflow management perspective, the goal in this PR was to make something mergeable which does not need to be the full final solution -- note the absolute lack of change in the main package -- and can also be kept in history as research notes for why we choose what we do. It shouldn't make merge conflicts with anything else in progress either.

P.S. yeah the uvarint impl speciated for strings in #71... :+1: I bet that's going to be impactful. I didn't write benchmarks here yet either for how much so, but there's no way evading the byte slice conversion there isn't gonna be helpful.