PROPOSAL: Recommend that ILP Addresses are only encoded in Base64

adrianhopebailie commented 8 years ago

The only requirement for ILP Addresses is that they are a binary string that can be used for prefix based matching at connectors.

Therefor RFC 3 does not need to specify an encoding but can recommend one that allows the binary string to be encoded into a human-readable form.

There is value in making this simple because it reduces the attack surface exposed by unusual addresses.

PROPOSAL: RFC 3 should say the following:

ILP Addresses are an octet string
It is not recommended for implementations to attempt to encode the address in any encoding but Base64URL
Human-readable addresses should be defined using only the Base64Url (RFC 4648) character set and then decoded. (This implies that they should then also be defined in multiples of 4 characters to avoid needing padding bits).
The - character (index 62) can be used as a delimiter in this case.
It would make sense for root addresses to use 4 or 8 chars so that subsequent segments can start with a - and then use 3, 7, 11, 15 etc. chars

Examples: Bob's USD account at Well's Fargo in the USA usa-bank-wfa-usd-bob = 0xba 0xc6 0xbe 0x6d 0xa9 0xe4 0xfb 0x07 0xda 0xfa 0xeb 0x1d 0xf9 0xba 0x1b Adrian's ZAR account at Standard Bank in South Africa rsa-bank-std-zar-adrian0 = 0xae 0xc6 0xbe 0x6d 0xa9 0xe4 0xfa 0xcb 0x5d 0xfb 0x36 0xab 0xf9 0xa7 0x6b 0x89 0xa9 0xf4 A Bitcoin account bitcoin--145b3dEskk1a7U...

emschwartz commented 8 years ago

Even though we could have addresses be any binary string, I think this would be an unnecessary barrier to adoption. I think it would be much more developer friendly to have readable addresses rather than 0xae 0xc6 0xbe 0x6d 0xa9 0xe4 0xfa 0xcb 0x5d 0xfb 0x36 0xab 0xf9 0xa7 0x6b 0x89 0xa9 0xf4

adrianhopebailie commented 8 years ago

They are readable when you encode that binary as a base64url encoded string

justmoon commented 8 years ago

For reference, here is the work on crypto-condition base64 encoding which inspired this: https://tonicdev.com/justmoon/cc-uris

In the CC case, we decided not to go that route.

The biggest issue I see with the encoding as proposed by @adrianhopebailie is the multiple-of-4 weirdness, e.g. you give the example: rsa-bank-std-zar-adrian0 which encodes to aec6be6da9e4facb5dfb36abf9a76b89a9f4. Neat! So you might think that you can also encode rsa-bank-std-zar-adrian, right? Well, yes, that would be aec6be6da9e4facb5dfb36abf9a76b89a9. Only problem is that that decodes to rsa-bank-std-zar-adriak. In other words you need to be very careful to use a length that is a multiple of four in base64, otherwise there is no bijection.

What if the addresses were actually written with that in mind, e.g. with a space every four characters?

rsa- bank -std -zar -adr ian0

This makes it a lot easier to see why -adrian would be an invalid address - the last grouping would be incomplete. Also has an interesting parallel with IBAN numbers which are grouped in fours. Here are some IBAN numbers as ILP addresses:

IBAN DE44 5001 0517 5407 3249 31-- => 20100d0c4e38e74d35d39d7be78d3bdf6e3ddf5fbe IBAN SA03 8000 0000 6080 1016 7519 => 20100d480d37f34d34d34d34eb4f34d74d7aef9d7d

Now, identifiers with spaces are kinda difficult to handle, so maybe a period . instead of a space would be better.

IBAN.DE44.5001.0517.5407.3249.31-- => 20100d0c4e38e74d35d39d7be78d3bdf6e3ddf5fbe IBAN.SA03.8000.0000.6080.1016.7519 => 20100d480d37f34d34d34d34eb4f34d74d7aef9d7d

Pros:

ILP addresses would look somewhat distinctive
Full bijection with octet strings (any octet string is an ILP address and vice versa)*
Less confusion with domain names (starting to look more like an IP address)

Cons:

More complicated
Prefixes that don't align with four character boundaries wouldn't be prefixes in the binary (although they would be ranges)
Awkward having to align things on four character boundaries

</crazy-theory-time>

For now I still like the idea that addresses are just a simple subset of IA5/ASCII and encoded as such. Yes, it's less efficient, but it's easy enough to understand. Still this is interesting and worth exploring.

adrianhopebailie commented 8 years ago

For now I still like the idea that addresses are just a simple subset of IA5/ASCII and encoded as such. Yes, it's less efficient, but it's easy enough to understand. Still this is interesting and worth exploring.

One could look at this proposal in that light (i.e. The allowed subset of IA5/ASCII is the same character table defined by base64url ([A-Za-z0-9_-]+).

What we need to stress though is that ILP Addresses are OCTET STRINGS. The encoding to Base64Url is just a convenience for human-readability.

We must define how our subset of IA5/ASCII is decoded to binary and how binary addresses are encoded to this human-readable form. We can either invent a new encoding or re-use one like base64url.

It actually doesn't matter that rsa-bank-std-zar-adrian and rsa-bank-std-zar-adriak are converted to the same binary form because any processing of the address should be done on the binary form not the human-readable form.

Obviously, it would be ideal if entities generating addresses did so in a way that this was avoided using complete sets of 4 base64url chars but it's not essential.

An alternative is to live with ASCII (or a subset of that) and use the 8th bit in each octet as a check bit/parity bit? It's less efficient but may be easier to live with.

adrianhopebailie commented 8 years ago

Using space or . as a separator is quite a nice touch for readability.

The encoding/decoding rules could simply state that any non-base64url chars are stripped out before decoding and that after encoding a space/. is inserted between every group of 4 chars to improve readability.

Another reason I like this is because it's easy to use a UUID and tag it onto a well known address (which is already a 4 char multiple). So ledgers can take time to define their address as being "well formed" by putting - at the end as required eg: rsa-bank-stand-- and then simply create UUIDs for account identifers and tack those on to address accounts:

rsa-bank-stand--123e4567-e89b-12d3-a456-426655440000

emschwartz commented 8 years ago

What we need to stress though is that ILP Addresses are OCTET STRINGS. The encoding to Base64Url is just a convenience for human-readability.

This worries me a lot. If there is a human-readable form people will use it, whether we tell them to or not.

Since it's so core and we don't know yet all the ways ILP will be used I would much rather have a format that has a sensible human and machine readable form that doesn't invite obvious problems converting between the two.

👍 for subset of ASCII or something along those lines

adrianhopebailie commented 8 years ago

This worries me a lot. If there is a human-readable form people will use it, whether we tell them to or not

I think we've gone off base a bit.

Any string is the human-readable form of some bytes. The encoding just tells you how to convert between the bytes and the characters.

We want people to use the human-readable form and we need to define which characters they can use. Then we must also define which encoding is used when that string is converted to bytes and back to a string.

We have to define the encoding so that different systems that transfer messages over the wire can understand the bytes they receive and turn them into the right string.

If someone wants to use strings internally in their system they can do so but doing so means first decoding the ILP header and using the right encoding to get the address in it's string form. After that it doesn't matter.

If they want to be efficient and have no need to display the addresses to humans they can just leave the bytes raw as they come off the wire.

If we only allow a subset of ASCII then the question is why?

My proposal is to define ILP Addresses as simply:

Any string using only the base64url character set where the total number of chars is always a multiple of 4. Regex = /^(?:[A-Za-z0-9_-]{4})+$/.
The string should be base64url encoded/decoded when converting between binary and string forms
(optional) For readability a . MAY be inserted between groups of 4 chars and is stripped out when decoding.

(This is basically what IP addresses are. 4 bytes, encoded as integers and separated by . for readability)

emschwartz commented 8 years ago

Why bother having the requirement that the number of characters is a multiple of 4? Why shouldn't ilpdemo.red be a valid address?

The current schema is just /^[a-zA-Z0-9._~-]+$/. Every system should be capable of understanding ilpdemo.red as 0x696c7064656d6f2e726564 and vice versa.

ASCII seems more sensible to me than base64url. I'd rather not explain for the rest of my life why we made such an unusual decision.

adrianhopebailie commented 8 years ago

Why bother having the requirement that the number of characters is a multiple of 4? Why shouldn't ilpdemo.red be a valid address?

That's basically what the decision boils down to. That is the compromise you make for the efficiency of base64url.

Personally I don't think forcing sets of 4 is such a bad thing. You can use - and _ to delimit strings as you like. Also it has the nice property that UUIDs are a multiple of 4 chars so you can easily append one to a ledger address to create an account address or invoice address.

ASCII seems more sensible to me than base64url. I'd rather not explain for the rest of my life why we made such an unusual decision.

Instead we have to explain why we limit the character set then? Why are we wasting bits?

Sidenote: Some of the other encoding schemes like base58 explicitly leave out chars that could be confused like 0 and o 😄

emschwartz commented 8 years ago

To me this seems like a minimal efficiency gain that comes at the expense of a not insignificant barrier to human understanding and thus adoption.

As Wikipedia says, "Base64 is a group of similar binary-to-text encoding schemes that represent binary data in an ASCII string format".

The question is whether the addresses are first text or binary. Since they are so core to ILP I would strongly advocate for them being human (developer) understandable first.

adrianhopebailie commented 8 years ago

The question is whether the addresses are first text or binary. Since they are so core to ILP I would strongly advocate for them being human (developer) understandable first.

👍 - So first decision is what chars we allow. Seems like we like the base64url set but we also like . too.

So let's say it's [a-zA-Z0-9._~-] as @emschwartz suggests (although I don't like ~ not sure it adds any value).

Then we need to still explicitly define the encoding and I hear you saying ASCII. The compromise being wasted bits given our limited char set.

That begs the question, why are we limiting the charset at all? In fact, why not just use UTF-8 so that people can use international characters?

emschwartz commented 8 years ago

That character set is the unreserved characters in URIs:

Characters that are allowed in a URI but do not have a reserved purpose are called unreserved. These include uppercase and lowercase letters, decimal digits, hyphen, period, underscore, and tilde.
  unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"

From RFC-3986.

why not just use UTF-8 so that people can use international characters?

That was part of the discussion, and there are decent arguments for that. @justmoon was the one that was looking into this more but I believe two of the arguments against using UTF-8 is the ease of confusing homographs and that the standard is constantly changing.

adrianhopebailie commented 8 years ago

That was part of the discussion, and there are decent arguments for that.

I recall that. I guess we need to pull all of this together to justify our design decisions. Given that addressing is so core we should take the time to do that in an RFC or something.

The homograph confusion only really matters if addresses are verified/transcribed by humans. One could argue that they never should be but then you could also say there's no good reason to support human-readable addresses at all which leaves us with something like IPv6.

adrianhopebailie commented 8 years ago

p.s. The final option is to define our own encoding :smile:

Everybody's doing it!

justmoon commented 8 years ago

So first decision is what chars we allow.

Which breaks down into:

Which IA5/ASCII characters do we allow?

Adding a ton of allowed special characters doesn't add much value and constrains the contexts where ILP addresses can be used. So we decided on a pretty common, conservative set - the URL-safe (unreserved) characters.

Do we allow UTF-8 characters?

There are essentially two options here:

Allow any binary and just render it as UTF-8
Allow only valid UTF-8 and possibly only a subset of codepoints

The former means that any UTF-8 is allowed and it's up to implementations to limit what they are willing to render. In addition, they have to figure out how to signal that they've not rendered something while still allowing you to refer to that thing. Imagine a DoS attack where you can't block the originating IP because it contains some glitchy character and you can't properly copy it.

This option seems like a generally very b̧̢̪̜̥͓̭̻͇͖̔͑̽͊̇̂̇̎́͒a̛̞̟̹̖͓͓̠̪̲̭̓̃̓̋͆͑͛̀̚d̨̛͈͖̩̣̬̟̜̬̏̓͒̽̚͘͠͠͝ͅ ̧̥̗̮͉̥͓̪̬͂̒́͂̇̑̀̿̊͠ͅȋ̧̡̞̱̫͙͉̻̣̣̓̆̈́̂̓͑̾͝͝d̢͇͚͈̲͇̙͕̻͌̓̋̊̐́̌͘͜͝͠e̢̢̨͉͖͎̱̪̩̗̾̍̊̑̄̐̂̎̚͠ą̢̧̳̭̬̳̝̼̖̆̉͋́́̀͊͂͑͠...

The latter means we need to create correct UTF-8 implementations for languages that don't have them natively (like JavaScript). To restrict characters to some sensible subset, we would have to go over the entire Unicode set, select acceptable codepoints and implement validators for every Interledger implementation. It would take us years and we'd have to update it every couple of months when new Unicode versions are released.

That leaves us with [a-zA-Z0-9_~.-] encoded as plain IA5/ASCII.

You can look at it and say that we're being culturally biased and not very international, which is fair. But practically speaking, supporting all character sets is too complex, and if we're going to support only one, the Latin character set is the most international. When I browse Chinese Github there are a lot of Latin characters, but not a lot of Cyrillic or Arabic ones. Latin characters are also printed on most keyboards worldwide, even the ones that also support another language.

adrianhopebailie commented 8 years ago

You can look at it and say that we're being culturally biased and not very international, which is fair.

Welcome to the world of global standardization 😄

I think there is a good enough case to keeping this simple. My concern is avoiding the disaster that was retrofitting IRI to the world of URIs. I think we can justify our design decision though because ILP addresses are not expected to be exposed to end users.

One caveat to that is defining ILP addresses for accounts in countries or at institutions where the natural label for the account would be in Cyrillic, Arabic, Chinese or some other non-Latin set. We should try to get input from that audience to hear how practical this really is.

Using a limited set that is enforced by the encoding is another way to justify the decision but I accept that the complexity this adds by requiring addresses to be expressed in multiples of 4 chars is a problem.

emschwartz commented 7 years ago

I think we're sticking with ASCII for now so I think we can close this issue

adrianhopebailie commented 7 years ago

+1 - Where have we defined the address format for reference?

Also, I think the right definition for us to use is IA5 (this is also an ASN.1 type) as opposed to ASCII because it's an international standard .

justmoon commented 7 years ago

@adrianhopebailie I wrote an ASN.1 definition of the address format here:

https://github.com/interledger/rfcs/blob/docs/st-0003-asn-formatting/asn1/InterledgerTypes.asn#L19-L28

interledger / rfcs

PROPOSAL: Recommend that ILP Addresses are only encoded in Base64 #59

Which IA5/ASCII characters do we allow?

Do we allow UTF-8 characters?