Follow the spec for decoding a string CID

alanshaw commented 4 years ago

https://github.com/multiformats/cid#decoding-algorithm

When decoding a CIDv0 string we just try to decode with base58btc, when we should first check the string begins with Qm and is 46 characters long.

agustinmessina commented 4 years ago

Hi, I would like to help with this issue.

I read the decoding algorithm and it says "If it's 34 bytes long with the leading bytes [0x12, 0x20, ...], it's a CIDv0". Looking at the tests, all the inputs for CIDv0 start with Qm and are 46 characters long.

Could you provide me a valid CIDv0 that doesn't start with Qm and/or is not 46 characters long? So I can write some unit tests and try to solve this issue.

alanshaw commented 4 years ago

This is for case (1) of https://github.com/multiformats/cid#decoding-algorithm where the CID is a string.

agustinmessina commented 4 years ago

Sorry, maybe I didn't understand the issue.

You said we should check if the string begin with Qm and is 46 chars long. If we add that check and the string doesn't pass, what should we do?

alanshaw commented 4 years ago

Follow the part of the spec that says "Otherwise, decode it according to the multibase spec..."

agustinmessina commented 4 years ago

As I understand it, a cid can be v0 if:

starts with Qm and is 46 chars or
is 34 bytes long with the leading bytes [0x12, 0x20, ...]

That's why I'm asking for a cid example that fullfils the second clause. I want to write a unit test with that cid as an input

rvagg commented 4 years ago

> new CID('QmV88khHDJEXi7wo6o972MZWY661R9PhrZW6dvpFP6jnMn').multihash.join(',')
'18,32,100,204,247,215,142,116,181,80,84,227,13,252,171,193,40,208,244,156,33,84,136,182,185,131,50,55,21,110,224,183,153,169'
> new CID('QmV88khHDJEXi7wo6o972MZWY661R9PhrZW6dvpFP6jnMn').multihash.length
34
> new CID('QmV88khHDJEXi7wo6o972MZWY661R9PhrZW6dvpFP6jnMn').toString()
'QmV88khHDJEXi7wo6o972MZWY661R9PhrZW6dvpFP6jnMn'
> new CID('QmV88khHDJEXi7wo6o972MZWY661R9PhrZW6dvpFP6jnMn').toString().length
46

v0 is encoded in byte form as a pure multihash, hence that rule #2. v1 adds two varint's at the front of the byte form.

I'm doing byte-decoding of CIDs for my JS CAR file format library which handles this: https://github.com/rvagg/js-datastore-car/blob/9eb82a9c88ac4051f8c93901da366f82f910e49c/lib/coding-browser.js#L66-L71 Some Go code doing same thing: https://github.com/ipfs/go-car/blob/f188c0e24291401335b90626aad4ba562948b525/util/util.go#L23-L26

I don't think this issue is calling for this second case, it's just about being more strict with checking the first two characters rather than jumping straight into a straight base58 decode.

Although, it would have been great to be able to call into js-cid for the decode I did in js-datastore-car, go-car has the same problem where it's doing work that should be exposed by go-cid. "Here's a byte array, it could be v0 or v1, figure it out and give me a CID". The problem we have with the CAR format (and likely other forms of on-the-wire data), is that we don't have a neat bundle of bytes to check, you have a stream that you have to chew at, and test as you go, it could be 32 bytes, or it could be a lot longer with some crazy-long hash, but you'll find out when you: figure out if it's not a v0 so has some varints at the front that will tell you about the length.

jacobheun commented 3 years ago

This should no longer be an issue due to the way decoding is now handled. Closing this.

multiformats / js-cid

Follow the spec for decoding a string CID #103