Ambiguous encoding due to underlying design

multiformats / py-multibase

Multibase implementation in Python

MIT License

22 stars 9 forks source link

Ambiguous encoding due to underlying design #11

Closed eth-r closed 1 year ago

eth-r commented 6 years ago

The truncation of leading zeros in the BaseStringConverter class this library is built with creates an ambiguous encoding. This is especially noticeable in base-2 where the example test case "yes mani !" encodes to "01111001011001010111001100100000011011010110000101101110011010010010000000100001", while a strict encoding would be "001111001011001010111001100100000011011010110000101101110011010010010000000100001". This means that decoding "\x00yes mani !" results in "yes mani !", an erroneous interpretation in many contexts.

More info: https://github.com/multiformats/multibase/issues/34

sauerburger commented 2 years ago

I also encountered this while manually decoding 12D...-style IPFS (or libp2p) peer IDs. IIUC, the leading 1 indicates legacy base58 encoding of an identity multihash. However, passing the result of multibase.multibase.ENCODINGS_LOOKUP['base58btc']() to multihash fails due to the removal of the leading \x00 byte.

FWIW, the py-cid implementation relies on the third-party base58 package and could not switch due to this issue.

Wind4Greg commented 1 year ago

Note that base64 doesn't have this problem. Simple example where this prevents reversing the transformation for base58btc:

# bytes starting with zero byte not correctly recovered
test_bytes = b'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t'
print(test_bytes)
print(encode('base58btc', test_bytes))
print(decode(encode('base58btc', test_bytes)))
# b'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t'
# b'zkA3B2yGe2z4'
# b'\x01\x02\x03\x04\x05\x06\x07\x08\t' # Missing first byte

Any chance that this might get fixed?

rvagg commented 1 year ago

closing due to inactivity, archiving repo