Make base32 CIDv1 the default for go-ipfs

kyledrake commented 6 years ago

I understand there's a switch to CIDv1 soon. I think go-ipfs should use lowercased base32 (rfc4648 - no padding - highest letter) as the default multibase.

The reason this encoding is preferable: it's the one encoding that will work with subdomains (RFC1035 + RFC1123). The restrictions are: case-insensitive, a-b0-9 and less than 63 bytes.

For a slight increase in length, you reap enormous benefits:

The ability to do proper security origins for the HTTP gateway with subdomains (cidv1abcde.dweb.link). This is very important if we want to handle reports with Google's safe browsing system (which is designed for origins). With the current design, all content is on the same browser origin, and a single phishing/malware report on any of the IPFS gateways (hosted by us or someone else) will make web browsers block every single thing on the origin with a giant red warning message until it's cleared up with Google (which from experience can take several days!)
Root paths are in the right place, which dramatically improves compatibility with existing web sites that tend to do a lot of this: <img src="/rootimg.jpg">
Allows us to register dweb.link (and ipfs.io, etc.) to the Public Suffix List, which will prevent the sandboxed content from reading/manipulating cookies on the parent domain (and on other cidv1 subdomains).
Opens up the ability for go-ipfs to do HTTP Host Header parsing and automatic Let's Encrypt support (if we wanted to), so anyone can set up a public IPFS gateway without additional software. Once Let's Encrypt gets their wildcard cert domains shipped (Dec 2017), this could be a fully automated process. Otherwise something like nginx would be needed (I could write an example nginx.conf that people could use for it).

It should use lowercase base32 characters by default, so that it's consistent with subdomain usage (all the browsers will force lowercase). IIRC the RFC doesn't care if it's lowercased, I think people just default to upper case for legacy reasons.

Obviously an abstraction layer could be written that converts between base32 and something else for use with web gateways and then have a different default, but I think it would less confusing for end users to use one default: the one that will let origins in browsers work.

This approach shouldn't be a problem for webextension plugins, but @lidel feel free to chime in.

Further reading: https://github.com/neocities/hshca

lidel commented 6 years ago

Good arguments, especially the one about GoogleSafeBrowsing's false-positives for public gateways 👍

CIDv1 format is strongly related to discussion at ipfs/in-web-browsers: Tackle identifying origins with (or without?) fs: paths.

I was unable to find definitive, final decision on which exact encoding will be used apart from @lgierth initially pondering "base16 or base32" and @samholmes suggesting base32 with Crockford's Encoding.

Was the decision made elsewhere? If not, this ticket provides good opportunity to do so 🔧

daviddias commented 6 years ago

Thank you for creating this issue, @kyledrake. I agree with your proposal, we can take the opportunity that we are bringing CID to the world for the first time to get base32 as the new default.

If not handled internally correctly (i.e using the string format vs the binary format) it will add significant overhead, but that is just something we can change internally to make sure that we use memory efficiently.

ghost commented 6 years ago

I was unable to find definitive, final decision on which exact encoding will be used apart from @lgierth initially pondering "base16 or base32" and @samholmes suggesting base32 with Crockford's Encoding.

I'm strongly in favour of making base32 the general default encoding for CIDs everywhere. We need base32 for the ipfs:// URL scheme, and it'd suck if people had to deal with different CID encodings, or even have to use converter tools.

ghost commented 6 years ago

And, I think we haven't had any decision on it -- we just sticked with base58 as that was the original encoding used from the beginning.

daviddias commented 6 years ago

e just sticked with base58 as that was the original encoding used from the beginning.

Yeah, that was pretty much how the decision got made. Still in time to change though.

ghost commented 6 years ago

Still in time to change though.

Well I'm all for it :):)

samholmes commented 6 years ago

Notes on Base 32 Encoding

What would it take to get a new base added to the multibase table? Specifically, what would it take to add Crockford's Encoding to the table. As of commenting, it appears RFC4648 and z-base-32 are the only base 32 encodings included in the multibase spec.

An added reason to push for Crockford's Base32 is that it meets the same criteria as Base58Check, the base58 encoding Bitcoin uses for bitcoin addressses:

// Why base-58 instead of standard base-64 encoding? // - Don't want 0OIl characters that look the same in some fonts and // could be used to create visually identical looking account numbers. // - A string with non-alphanumeric characters is not as easily accepted as an account number. // - E-mail usually won't line-break if there's no punctuation to break at. // - Doubleclicking selects the whole number as one word if it's all alphanumeric.

It seems to be like Crockford's Base32 naturally fits the same goals as Base58Check with the added feature of being case-insensitive.

ghost commented 6 years ago

Let's use whatever base32 variant Javascript and (less important) other programming languages use as their default base32. (I assume it's Crockford's)

samholmes commented 6 years ago

Notes on URLs and URI Schemes

From what I can tell, there is no obvious direction for solving issues surrounding URI Schemes and browser origin policies coupled with them thus far. However, my rough proposal is up for further commenting.

However, my inclination is to specify an alternative format and standard from URI. Then, leave it up to implementations to bridge this new format to a purposed URI scheme. Although a hack at the implementation level, it would open up an opportunity to re-think what a web address could be. Maybe a multiresource standard should be defined and added to the multiformats basket?

samholmes commented 6 years ago

@lgierth I don't know if there is a default base32 encoding in Javascript. If you would consider Javascript's native toString Number method:

var a = []; for (var i = 0; i < 32; i++) a.push((i).toString(32))
console.log(a);
// (32) ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v"]

It appears toString uses Base 32 Encoding with Extended Hex Alphabet from RFC4648.

Other than this, the Javascript community modules include many variants of base32 encodings; among them is Crockford's. So, it's safe to say that it's not an obscure encoding at the least.

kevina commented 6 years ago

@samholmes

What would it take to get a new base added to the multibase table?

I am not sold that Crockford's Base32 is better than than rfc4648 (that we already use the to in the flatfs datastore) but I don't see any problem with adding an entry to the table and implementing it in go-multibase. Step one would be to open an issue here: https://github.com/multiformats/multibase/issues.

kyledrake commented 6 years ago

Support for crockford base32 (base32check?) seems fairly widespread:

Since there seems to be a strong preference for it, I hereby revise the proposal to use crockford base32.

kyledrake commented 6 years ago

Worth noting is that there's several different flavors of base32, including (my personal favorite) one that Nintendo games used that was designed to avoid profanity.

I'm kindof indifferent as to which version gets used. I chose RFC because it's a standard, it's been around a while, nginx-misc-module supports it, and it probably has the widest support across all programming languages. My only strong preference here is that it's a variation most programming languages already support, so we can minimize devs having to re-invent wheels.

Crockford seems to fit the bill more-or-less as well as RFC, which is my rationale for being OK with using it.

@kevina would you have very strong objections to crockford being used by default by go-ipfs with cidv1?

kevina commented 6 years ago

@kyledrake concerning some of the issues (in particular the use of cidv1abcde.dweb.link) have a look at https://github.com/ipfs/go-ipfs/issues/1678#issuecomment-157478515 it the full issue rather long but contains lots of useful context in why we currently use /ipfs/Qm.../hash

kevina commented 6 years ago

@kyledrake if we switch to using Base32 I do not have any strong objection to using crockford over the RFC one. The only reason I would chose the RFC one is because it is a standard and more likely to have an implementation available as part of the language.

What I do have an slightly stronger objection to is switching to Base 32 from Base 58 due to the increase in length. Let's see how this is progress with various proposed changed:

What	Length	Increase	Example
CidV0	46		`QmUNLLsPACCz1vLxQVkXqqLX5R1X345qqfHbsf67hvA3Nn`
CidV1	49	+6.5%	`zdj7WbTaiJT1fgatdet9Ei9iDB5hdCxkbVyhyh8YTUnXMiwYi`
Base32	59	+28%	`BAFYBEICZSSCDSBS7FFQZ55ASQDF3SMV6KLCW3GOFSZVWLYARCI47BGF354`

Blank2b-256	52	+13%	`zDMZof1kvswQMT8txrmnb3JGBuna6qXCTry6hSifrkZEd6VmHbBm`
Base32	62	+35%	`BAFYKBZACEBUGFUTJIR6QIE7APO5SHPRY32RUWFI762UYTD5G3U2GK7TPSCNDQ`

So if we ultimately go with using CidV1 using Blank2b-256 and Base32 as the default the length of the Cid string will increase 35%. That is a non-trivial amount as apposed to the (I think) original plan of switching to CidV1 using same sha256 hash which provides a minor increase of 3 characters or 6.5%.

However, if everyone else is okay with this length increase I am not going to block a move to using Blank2b-256 and/or Base32.

samholmes commented 6 years ago

However, if everyone else is okay with this length increase I am not going to block a move to using Blank2b-256 and/or Base32.

I'm okay with the increase in length. If I am not mistaken, the trade-off would be making it easier to use the same CID within an /ipfs/<CID> address and a URL address. 😃

ghost commented 6 years ago

I'm very comfortable trading increased length for increased portability. Not making base32 or base16 the default means that the browser UX of IPFS will suffer. Even if we skip the <hash>.dweb.link idea, we need base32 CIDs for ipfs://<hash>, and not being able to paste CIDs from go-ipfs into the browser would be a little catastrophe :(

Support for crockford base32 (base32check?) seems fairly widespread:

I'd be more interested in what stdlib-type libraries use, rather than some individual's library. Random data points: golang's encoding/base32 uses RFC 4648, and the coreutils base32 command does too.

Could someone check what other important libraries and tools use, so that we get a small survey?

ghost commented 6 years ago

On a different note, we should default to lowercase base32 for readability (and of course accept reading both uppercase and lowercase).

kevina commented 6 years ago

Any objects I have to the increase length are mild. It just that things like increase in length can creep up on you and at some point a few years later we stand having keys 2-3 times the length of the original. Not saying it will happen, but want to explain where my (mild) objection is coming from.

@lgierth why won't our existing base (base58btc) work if you go with ipfs://<hash>? A pointer to another issue documentation is fine.

kevina commented 6 years ago

Also, I agree with using lowecase as the default.

Stebalien commented 6 years ago

In general, browsers assume that security origins are insensitive. With ipfs://hash, hash is the security origin.

ghost commented 6 years ago

As per the WHATWG URL spec, the hash in ipfs://<hash> is a domain, which needs to be a valid label according to RFC 1035.

That's why @kyledrake made hshca

kevina commented 6 years ago

@Stebalien @lgierth thanks

kevina commented 6 years ago

So, if we do make base32 the default so that we can represent them on the domain component of the URL, the question I have is: How will we reference CidV0 objects, since we can't completely eliminate them?

I created an issue to discuss this https://github.com/ipfs/go-cid/issues/34.

mib-kd743naq commented 6 years ago

I know I am super-late to the party, but is there a reason you are not considering base36 ( as opposed to base32? ) A fully "framed" blake2b 256 bit CIDv1 is 38 bytes long.

In base32 (due to padding specifics) this is 62 characters In GMP base36 this is 38 * log(256) / log(36) ~ 59 characters

While the saving of 3 characters is not huge, it's still something... And all the implementations have a GMP engine due to needing to support base58 anyway. An upside is that the question of which base32 padding scheme is used goes away entirely.

kevina commented 6 years ago

@mib-kd743naq base 32 is more standard. Also computing base 32 (and base 64) is far cheaper than base 36 or base 58 because non-power of two bases require expensive integer division. The tiny amount of space saved is not worth it.

samholmes commented 6 years ago

@mib-kd743naq I'd argue another reason for base32 over base36†: Confusion over look-alike characters (I, L, and O can be confused with number symbols for one and two). This is why base58 was chosen for Bitcoin (https://en.bitcoin.it/wiki/Base58Check_encoding) instead of something like base64, and I assume it is the reason why IPFS adopted base58 as well.

My argument for Crockford's encoding for base32 stands over older encoding standards like RFC 4648. I think that Crockford's encoding is the closest equivalent to the goals for Base58Check while having the added property of being case-insensitive. It's also nice to have digit equivalents shared between decimals and hexadecimals: 0-9 symbols are the same in all three bases, and a-f is the same in base16 and base32 (when base32 is encode using Crockford's encoding). This means a single symbol can represent the same value in all three bases (10, 16, and 32).

Side note: Maybe in the future, humans will have the capacity to read base 32 and base 16 numbers as naturally as they read base 10.

Edit:

† Mistyped "I'd argue another reason for base32 over base64: ...", when I meant to say "I'd argue another reason for base32 over base36: ..."

kevina commented 6 years ago

@samholmes I don't think anyone here wants to use base64, @mib-kd743naq was arguing for base36.

mib-kd743naq commented 6 years ago

Fair enough. My dislike of base32 is that it is an unwieldy encoding, with a pre and post padding scheme, that has to be done using "chunk math". Consider the case of a 38byte CID:

the first 35 bytes are processed 1:1
The result needs to be properly 0 ( i.e. A )-padded given the CID itself often has leading 0s
the remaining 3 bytes need to be padded as well
ugh.

In comparison: with base58 all one has to do is feed the CID value to the hand-optimized GMP binding and that's it.

In any case - 32 it is.

kevina commented 6 years ago

@mib-kd743naq we are proposing base32 without any padding characters. In that light, I am not sure how base32 is an unwieldy encoding compared to say base 36 or base 58.

mib-kd743naq commented 6 years ago

@kevina this is the padding I am referring to: https://github.com/whyrusleeping/base32/blob/c30ac30633ccdabefe87eb12465113f06f1bab75/base32.go#L121-L164

kevina commented 6 years ago

@mib-kd743naq base32 (when padding characters are not used) doesn't have to be implemented that way. I don't know the full details of a base 36 or base 58 implementation but I believe a base 32 implementation could be implemented in a manor similar to base 36 or base 58.

samholmes commented 6 years ago

@kevina you missed my point. Let me clarify. One of the reasons why bitcoin used base 58 instead of something like base 64, or base 62, or any base higher than 58 was for human-readability.

kevina commented 6 years ago

@samholmes I think we are in agreement. I am not trying to advocate base64, I was using it as an example, base32 is the best all around compromise.

Kubuxu commented 6 years ago

Also it was very important aspect for Bitcoin as you are visually comparing address when making a transfer. There is not an issue in case of IPFS.

samholmes commented 6 years ago

@kubuxu Visual comparison is important for IPFS. Address spoofing is not an attack vector that should be had in IPFS.

ghost commented 6 years ago

Hosts in URLs are case-insensitive so base64 and base58 are out (as mentioned somewhere above)

ghost commented 6 years ago

I'm gonna make a call here because we need to move forward.

Let's go with @kyledrake's original proposal:

I think go-ipfs should use lowercased base32 (rfc4648 - no padding - highest letter) as the default multibase

If anyone has important reasons not to go with this proposal, please call a veto.

@whyrusleeping @Kubuxu @Stebalien is this still something we can include in v0.4.11?

Stebalien commented 6 years ago

The problem here is the CID spec and how go-ipfs implements the datastore.

According to the spec, all CIDv0s must be base58.
The datastore stores values by CID instead of by multihash so a CIDv1 can't retrieve a block associated with a CIDv0 even if the multihashes match.

Note: this currently "just works" because we ignore 1 in go and allow non-base58 CIDv0s.

Stebalien commented 6 years ago

@lgierth see https://github.com/ipfs/go-ipfs/issues/4143#issuecomment-325583021

ghost commented 6 years ago

@Stebalien ah right! Let's defer this to 0.4.12 then

olizilla commented 5 years ago

The success of the ipfs-in-web-browsers effort is becoming increasingly dependent on being able to use the CID as the authority in ipfs:// style addresses. The browsers normalise the authority section to lower-case before we can intercept them in a web-extension; non-base32 encoded CIDs get mangled before we get to them.

Where should I be pushing to help make this a thing @Stebalien @lgierth @kyledrake ?

olizilla commented 5 years ago

The libdweb conversation gives some context here https://github.com/mozilla/libdweb/issues/2#issuecomment-395599043

kevina commented 5 years ago

As I see it, the biggest hold up is the handling of CidV0 is base 32. Things will be easier if we just allow it, but others don't seam to want to, see https://github.com/ipfs/go-cid/issues/34. If we go this route the actual code will probably take a week or less to do.

If we want to use this opportunity to force the use of CidV1 as others seam to want, things become more complicated.

kevina commented 5 years ago

I'm not sure, but maybe this is something that we can work on in the upcoming developers meeting. See https://github.com/ipfs/developer-meetings/pull/16.

lidel commented 5 years ago

AFAIK we could introduce an intermediate step of keeping default CID at v0 until backend issues are resolved, but changing CIDv1 encoding to Base32.

That way if someone decides to opt-in to CIDv1 (--cid-version) they will use future-proof base32. This will help us with moving forward with creating tests and introducing support in user-facing libs.

~~Y/n?~~

Update: ipfs add --cid-base=base32 is tracked in https://github.com/ipfs/go-ipfs/issues/5233

daviddias commented 5 years ago

That way if someone decides to opt-in to CIDv1 (--cid-version) they will use future-proof base32. This will help us with moving forward with creating tests and introducing support in user-facing libs.

👍 👍

kyledrake commented 5 years ago

https://github.com/ipfs/ipfs/issues/337 is now the parent issue for this migration, please go there to follow progress.

lidel commented 3 years ago

CIDv1 been around for a while, and CIDv0 causes us more and more trouble:

people naively parsing them as Multihash instead of a CID – don't want to point fingers, but it's a known problem
CIDv0 getting b0rked by browsers like Firefox

I feel we are approaching the point when its less painful to make the breaking change than constantly debug those bugs and as people to use --cid-version 1.

@Stebalien @aschmahmann do you see any hard blockers for flipping ipfs add to use --cid-version 1 by default?

Stebalien commented 3 years ago

Given the fact that users can just "flip it back", no. We wanted to take the opportunity flip other switches, but that's not going to happen any time soon.

ipfs / kubo