multiformats / multibase

Self identifying base encodings
272 stars 74 forks source link

Multihash encoding recommendations (base64url, base58) #61

Open awakecoding opened 4 years ago

awakecoding commented 4 years ago

I've been looking for a way to get the benefits of multihash but without base58 encoding. After reading some more I realized that while base58 appears to be the most common encoding for Multihash because of its use in IPFS, the multihash specification doesn't mention base58 (https://tools.ietf.org/html/draft-multiformats-multihash-00).

My understanding is that the best would be avoid forcing a specific base encoding and combine multibase with multihash? Is this the recommended option in the future, and will there be a few "recommended" encodings to be used with multihash out of the ones supported by multibase?

For instance, it would make sense to recommend using URL-safe base encoding such as base64url to avoid losing some of the interesting properties of base58 when switching to a different base encoding. In my case, I dislike base58 encoding because it doesn't align well to byte boundaries, which is why I would rather use multihash with base64url.

Stebalien commented 4 years ago

Is this the recommended option in the future

Yes. While IPFS and libp2p currently use base58 encoded multihashes, we're moving to CIDs. CIDs are multibase encoded "typed" content addresses: https://github.com/multiformats/cid.

awakecoding commented 4 years ago

@Stebalien thanks. For my needs, I think I'll start using multihash (not CIDs) + multibase encoding, using base64url as the default. I wanted all the benefits of multihash, but wasn't really sold on the base58 encoding.

Stebalien commented 4 years ago

Awesome!

I would like to try to briefly sell you on CIDs if you're using hashes for content addressing. By including a few extra bytes to indicate the content type/encoding, you can later choose to change the the content type/encoding.

awakecoding commented 4 years ago

@Stebalien do CIDs have a way of representing "no codec"? I want to encode something similar to the output of "sha256sum" on a file, while retaining which hash type has been used, such that it is easy to switch between hash types. I originally thought IPFS would use the hash of the entire file, but instead it is using "DagProtobuf" which is a hash of some merkle tree structure (I'm not very familiar with it). I understand why using something other than the hash of the entire file is good for chunking larger files into smaller chunks, but here I want to store relatively short files (X.509 certificates).

My idea is to store X.509 certificates in a single file store, where each file name is the multihash of its contents. The PKI server would then build reference tables mapping X.509 certificate properties to the corresponding multihash value. Since hashes change over time for X.509 certificates, this system allows me to build tables for multiple hash types at the same time while storing the certificate only once. For instance, if I store files using sha256, nothing prevents me from hashing the certificate in SHA1 and creating a SHA1 to SHA256 mapping table, etc.

Long story short, the encoding part of CIDs is not really something I need, and most likely something I will never use, due to the relatively small file size. I want to make it easy to recover the multihash from the complete file, while making it an improvement of using unprefixed hashes produced by tools like sha256sum, etc.

awakecoding commented 4 years ago

In a similar use case, it is very common for package managers or scripts that download complete files from CDNs or external sources to include an expected hash of the file for integrity checks. In most cases, the argument includes the hash type explicitly, like MD5 or SHA256. I don't think these kinds of tools really need anything beyond the hash of the complete file, but they would likely benefit from being able to support multiple hash types at the same time without some custom way of specifying which hashing algorithm was used. Instead of "md5sum" or "sha256sum" we could have a simple "multihash" field, with a few hashing algorithms accepted.

Stebalien commented 4 years ago

do CIDs have a way of representing "no codec"?

Yes. There's a "raw data" multicodec (https://github.com/multiformats/multicodec/blob/master/table.csv#L34). Even better, CIDs pointing to "raw data" are valid IPFS files. That is, ipfs cat /ipfs/CID_OF_RAW_DATA will work.

For example the CID of "Hello World!" is bafkreid7qoywk77r7rj3slobqfekdvs57qwuwh5d2z3sqsw52iabe3mqne.

awakecoding commented 4 years ago

@Stebalien does this mean IPFS actually supports raw file hashes by default, such that if I hash my file using sha256sum I could possibly convert it to a CIDv1 and find the exact same contents in IPFS without using the DagProtobuf? If there is a really simple conversion path between raw file hashes and CIDv1, I could see potential value. Otherwise, there still exists a relatively straightforward conversion path between multihash + multibase and CIDv1 (decode + insert a few bytes).

awakecoding commented 4 years ago

Just a suggestion: maybe document clearly how CIDv1 can be used for the same kind of use cases I described above. By this I mean all cases where existing tooling performs hashing on complete files, especially for tools that download files and then check the contents for integrity. Those tools will likely never go beyond full file hashing, so if it looks like CIDv1 is normally only used with more advanced cases like IPFS and DagProtobuf, it won't look like a legitimate use case to use it only for full file hashes.

Stebalien commented 4 years ago

does this mean IPFS actually supports raw file hashes by default, such that if I hash my file using sha256sum I could possibly convert it to a CIDv1 and find the exact same contents in IPFS without using the DagProtobuf?

Yes. However, if your file is over ~1MiB, bitswap will refuse to transfer it because it won't be able to incrementally validate it.

If there is a really simple conversion path between raw file hashes and CIDv1, I could see potential value.

I'm not sure if there's a pre-built tool to do this but yes. All you need to do is take the raw hash digest and prepend the right prefix:

import (
  cid "github.com/ipfs/go-cid"
  mh "github.com/multiformats/go-multihash"
)

func main(hashDigest []bytes) []byte {
  return append(cid.Prefix{
    Version: 1,
    Codec: cid.Raw,
    MhType: mh.SHA2_256, // your hash type.
    MhLength: len(hashDigest),
  }.Bytes(), hashDigest)
}
Stebalien commented 4 years ago

Just a suggestion:

I agree. Given that the main issue here is ipfs interoperability, mind filing an feature request against https://github.com/ipfs/docs?