ipfs / go-cid

Content ID v1 implemented in go
MIT License
156 stars 47 forks source link

add CidFromReader #126

Closed rvagg closed 3 years ago

rvagg commented 3 years ago

We now have 2 CID decoder functions in go-car that really belong here and we should move (and properly test) them. The basic functionality is that I have either an io.Reader or just a []byte but I don't know how big the CID is but I'm pretty sure I know where the start of it is. I should be able to extract the CID and get an offset to the end of the parsed CID bytes.

ReadCid(buf []byte) (cid.Cid, int, error) - read a Cid from buf and tell me the offset after read: https://github.com/ipld/go-car/blob/71cfa2fc2a619d646606373c5946282934270bd4/util/util.go#L22

ReadCid(store io.ReaderAt, at int64) (cid.Cid, int, error) - read a Cid from store and tell me the offset after read: https://github.com/ipld/go-car/blob/wip/v2/v2/internal/io/cid.go (wip/v2 branch, not yet in master).

We have decodeFirst(bytes) in js-multiformats to serve a very similar purpose. Having it in the core library has uncovered some other uses outside of CAR decoding too.

Suggestions for more explicit naming of these functions welcome!

mvdan commented 3 years ago

I think we should just expose the []byte based API; one can always write the ReaderAt version in terms of it, assuming we can somewhat reliably predict how many bytes the CID will be. I guess technically the hash could be many many bytes, but in practice I imagine one could read a chunk of bytes (like 1024) and be pretty sure that the whole CID is contained within that read buffer.

rvagg commented 3 years ago

True, although the increasing use of identity CIDs makes this a bit messy. But, I now recall that I coupled decodeFirst() with another utility function in js-multiformats, inspectBytes() which could tell you how many you needed if you're in a situation where you might need to be concerned: https://github.com/multiformats/js-multiformats/blob/2fb9b6815d4b9cbb8212e2f1975b355d2d370e4b/src/cid.js#L312-L324

It only needs the first few bytes, a maximum of maybe 10. So it's a nice combo that could be used to safely deal with the flexibility challenges.

A maximal implementation could look like this and could also be used internally by a ReadCid() to do part of the decoding since it needs all of these things which are gleaned from the first few bytes anyway (at least this is what we do in JS now):

InspectBytes(buf []byte) (version int, codec int, multihashCode int, digestSize int, multihashSize int, size int)

That's probably too many return values and probably should be a struct and then I guess there'll be arguments about too many allocations, so a minimal form could just return the size I suppose.

rvagg commented 3 years ago

Thanks to @ribasushi for reminding me of a relevant thread on identity lengths just now! https://github.com/multiformats/multihash/issues/130

mvdan commented 3 years ago

What you say makes sense; an "inspect" API for the cases where one wants to support potentially huge CIDs, and a "read" API for the cases where one doesn't need to support them - so erroring with e.g. io.ErrUnexpectedEOF would be fine.

arguments about too many allocations

Returning a non-pointer struct shouldn't allocate; it should be practically equivalent to returning the list of parameters. It's more of a question of which is nicer in terms of API. You could also deduplicate the types in the return parameters, like:

InspectBytes(buf []byte) (version, codec, multihashCode, digestSize, multihashSize, size int)

Stebalien commented 3 years ago

We already have CidFromBytes which I believe is what you're looking for. We could also add a CidFromReader if you need it to work with an io.Reader.

rvagg commented 3 years ago

mm, you're right, I thought that was just for a strictly correct length number of bytes but that seems to do mostly what's needed here. We should try and use it in go-car to see what the limits are and what else we need to add.

mvdan commented 3 years ago

Confirming that CidFromBytes works for the cases where we're decoding from a buffer: https://github.com/ipld/go-car/pull/131

That said, I think a CidFromReader would still be very useful. We need that for stream-like reading of CARv1 files, for example: https://ipld.io/specs/transport/car/carv1/#cid

Right now we implement that in an internal package inside go-car, which is not ideal. Plus, we didn't implement CIDv0 decoding in that copy, so that's biting @raulk when he tries to use our carv2 module :facepalm:

I'll send a PR for CidFromReader shortly.

mvdan commented 3 years ago

I'm also not discarding having an "inspect" API in the future, but it's not useful for go-car right now, and it doesn't consume a fixed number of bytes, so... It's not clear to me that it's a clear win and needed right now. I'd put it off until someone actually needs it.

mvdan commented 3 years ago

Now at https://github.com/ipfs/go-cid/pull/127.