multiformats / go-multihash

Multihash implementation in Go
MIT License
239 stars 57 forks source link

Refactor registry system: no direct dependencies; expose standard hash.Hash; be a data carrier. #136

Closed warpfork closed 3 years ago

warpfork commented 3 years ago

This diff revisits the registry system to use more standard interfaces for hashers, and puts it to work to reduce the transitive dependencies of go-multihash.


Dependencies:

Previously, depending on the go-mulithash package would create direct dependencies to several other modules for their various hash function implementations. This meant that instead of go-multihash being a lightweight, easy-to-accept dependency itself, it became something which would noticably increase the size of your go.mod file, your package graph, your download sizes during development, and most concerningly, your compile output size in final products.

Now, there is a registry system (see the Register function), and the main go-multihash package only populates the registry with hashes that are available from the golang standard library by default. This means you gain no transitive dependencies on other libraries by importing the go-multihash package, and your binaries will not be bloated by hashers you don't use. (Your go.mod file may still show more repos; but they don't end up in your builds unless you actually refer to them).

There are now several new packages in go-multihash/register/*. These can be imported to register the hashes in those packages. If you want all the hashes that were previously available, just make sure to import "go-mulithash/register/all" somewhere in your program. (You can register hashes, too, without making PRs to this library; these packages are just here for convenience and easy use.)

This is a breaking change if you used hashes not found in the golang stdlib, such as blake2 and sha3. However, to update, all you need to do is ensure the relevant go-multihash/register/* package is imported anywhere in your program -- an easy change. A go-multihash/register/all package can be imported to get a hasher registered for all of the same multihash codes as before (but will correspondingly add the dependency weight back too, of course).


Standard hash.Hash:

Previously, go-multihash had its own definition of a HashFunc interface, and only exposed hashing through the multihash.Sum method. The problem with this was these definitions did not support streaming: one had to have an entire chunk of memory loaded at once, in a single contiguous byte slice, in order to hash it.

(A second, admittedly much more minor, problem with this was that one often had to write glue code to turn a hash.Hash into a multihash.HashFunc, and since most of golang uses the standard lib hash.Hash definition already, this was generally avoidable friction.)

Now, the Register function operates in terms of standard hash.Hash, and there is a GetHasher function which can get you a hash.Hash. (Okay, to be more precise, these functions take and return a factory function for a hash.Hash. You get the idea.)

Since the standard hash.Hash interface can operate streamingly, now it's easy to use go-multihash in a streaming way.

The multihash.Sum method works the same as always.


Be a data carrier:

Previously, go-multihash contained checks that any multihash indicator codes being handled were required to have a hash function registered for them. This made it very difficult to use go-multihash in a "forward compatible" way (and it also made a lot of practical bumps for this dependency-extraction refactor).

Now, go-multihash is willing to carry data, even if it doesn't know what kind of hash function would be associated with an indicator code. (Methods that you'd expect to parse things do still parse the varints, making sure they're sanely formatted. They just don't inspect and whitelist the actual integers anymore.)

I removed the ValidCode predicate entirely. It doesn't seem to serve any good purpose anymore.


Other:

I have not touched the Codes and Names maps in this diff. I think we should probably review (and probably remove) these, and instead direct people to use the go-multicodec package instead, which has the two advantages of decoupling registration of an implementation versus simply having a description, and also being automatically generated from the multiformats table. However, I wanted to check on feelings about this before doing the work (especially because they're somewhat entangled with a bunch of the tests in this package, making their removal somewhat nontrivial).

Most of the test files are now package multihash_test. This makes for some colorful diffs, but is not otherwise interesting. The reason for this is because the dependency separation process now requires the tests to import those register/* packages, and to avoid a cycle, that means, well, package mulithash_test.

I think there's probably more work to be done in making this library really shine. For example, in reviewing the Encode function, I see some allocations that look very likely to be avoidable... if the function was redesigned to be more aware of how it's likely to be used. However, I took no action on this, in part because this diff is big enough already, and in part because I think it might be reasonable to re-examine the relationship of this code to go-cid at the same time.

I dropped TestSmallerLengthHashID. It appeared to be testing an API that wasn't actually exported... and the nearest API that is exported (Sum) has a general contract of truncating a hash upon short length, so it was overall unclear what this test should be checking. Review might be needed on this.

The situation for murmur3 is still in need of resolution. It's commented out entirely for now. Questions are noted in the diff.

There's a 'register/miniosha256' package which sets the sha2-256 implementation to a non-stdlib one. If you don't import this package, you still get a sha2-256; it's just the stdlib one. I did not include this in the 'register/all' group. (Maybe it's faster; maybe it's not; but it's definitely not required, and I'm getting some reports it also shows weird on profiles, so I tend to think maybe one should really have to explicitly ask for this one.)

There are a couple other review questions marked with "// REVIEW" in the diff.

warpfork commented 3 years ago

(The test that's currently failing is for murmur3. Input needed on what we're going to do with that thing.)

willscott commented 3 years ago

is this ready for a round of review? are there any loose things that still need to happen?

warpfork commented 3 years ago

Okay! This is finally green.

I believe all comments are addressed.

Also note that I updated a bunch of the transitive dependencies. It was forced (see https://github.com/multiformats/go-multihash/pull/136/commits/877240bb0cbf040292354310a86c5a876c26fe6d / https://travis-ci.com/github/multiformats/go-multihash/builds/219591152 ). But none of our fixtures reported effects, so I presume this to be safe and joyously uninteresting.

Last call for any more reviews?

warpfork commented 3 years ago

I've attempted to push this downstream through go-cid, and here's what that might look like: https://github.com/ipfs/go-cid/commit/26b387af822073023f9c026a4ea68a6349abf20e#diff-88c2c4b0892fb0172322544e3937e937bb18beccecb1b24a1b5028758a43685bR346-R375

There's some questions worth a quick review there. Namely: it's entirely possible to make all the code work without a need for DefaultLengths, but there are some tests which conflict with it. It's not clear to me if those tests are testing something important and relied upon or if they happen to assert what they do incidentally.

Everything else looks fine -- although we do again end up seeing the go-multihash/register/all import now needing to appear in test files, which should probably not be a surprise.

Stebalien commented 3 years ago

There's some questions worth a quick review there. Namely: it's entirely possible to make all the code work without a need for DefaultLengths, but there are some tests which conflict with it. It's not clear to me if those tests are testing something important and relied upon or if they happen to assert what they do incidentally.

We do need DefaultLengths, we use it in quite a few places. You can find these places by:

  1. Updating go-multihash in go-ipfs, lotus, etc.
  2. Seeing what breaks.

Let's make this PR minimally breaking so we can merge it ASAP and move on.

warpfork commented 3 years ago

DefaultLengths is back in.

Here's what propagating through go-cid looks like now, with all tests passing: https://github.com/ipfs/go-cid/pull/118

warpfork commented 3 years ago

I guess the final check here is: do we want to keep github.com/multiformats/go-multihash, the root package of this repo, still dragging in all transitive dependencies, to really really minimize change? E.g. effectively have it import register/all and register/miniosha256?

https://github.com/multiformats/go-multihash/issues/131 did originally describe doing the registration rework in a new package, and keeping the go-multihash root unchanged.

As it stands now, the go-multihash symbols exported are pretty much unchanged, but the code does go all-in on making the dependencies optional, so the behavior does shift. If we're scared of this and the consequent need to add the registration imports in downstream applications, and we'd rather make the dependency minimization an opt-in action that has to be done separately from just updating the library by putting it in a deeper package, we can do that.

... heck, I'll just draft that now. If we introduce a new package, who wants to name it? go-multihash/core? or go-multihash/lite? or just put it in go-mulithash/register?

Stebalien commented 3 years ago

Yeah, let's minimize breakage This library is a bit popular: https://pkg.go.dev/github.com/multiformats/go-multihash?tab=importedby. A "core" package would work (I'd call it that).

I guess the question is: what do you need. This appears to be blocking other work, so I want to make sure we merge that need ASAP.

If you don't really need to reduce dependencies (I'm pretty sure you don't, at least right now) then I'd:

  1. Import these deps from the root for now.
  2. Release this as a 1.0 release.
  3. Immediately start on a 2.0 release that doesn't "register" everything.
  4. Make the 1.0 release depend on the 2.0 release.

Basically, we need unblock this ASAP and we need to avoid causing a meltdown. Let's just go with the easiest approach and introduce a 2.0 (we'll get there eventually regardless).

warpfork commented 3 years ago

Alright. go-multihash/core is now a package, it does the thing. go-mulithash still brings on all the deps unconditionally.

I think I've run out of changes to avoid. go-cid can now be updated to this last commit with literally no code changes.

I'll also bow out on version tagging. Whatever numbers you like, fine by me. (I'm wary of major numbers and the Different Behavior they prompt from go mod, but... what could go wrong, really)

Stebalien commented 3 years ago

I've tested with go-ipfs and the only broken test is the one that tries to use murmur (to make sure we don't allow it).

Stebalien commented 3 years ago

@warpfork could you cut a minor release when ready?

warpfork commented 3 years ago

This is now available within the v0.0.15 tag.