kstenerud / safe-encoding

Binary-to-text encoding that is safe to pass through modern text processors
Other
32 stars 5 forks source link

Human-oriented encoding #5

Open DonaldTsang opened 5 years ago

DonaldTsang commented 5 years ago

Since there are a list of characters that cannot be used in Windoes, URL, SGML or JSON (with " # & ' / : < ? \ > | * . that are all unuable)... Maybe it is time to invent a human-oriented base64 encoding that are less confusing with uppercase and lowercase (e.g. C and c, K and k, O and o, P and p, S and s, V and v, W and w, X and X, Z and z). BinHex isn't perfect. Criteria: Given most common fonts in office, when rewritten by hand, should not cause confusion.

DonaldTsang commented 5 years ago

Adaptation of other low-radix encodings: https://github.com/TJOLeary/base55gen https://github.com/2Lomi/Base53 https://github.com/ShaunTS/Base53 https://github.com/search?q=base52 https://github.com/xavierfaucon/BravoIds (base 62/50) https://github.com/anthonycr/Base47 https://github.com/jacoblyles/base43js https://github.com/liuvbill/iOS-Swift-Base43 https://github.com/sveljko/base41 https://github.com/wd5gnr/Base40Packing https://github.com/Jameskmonger/base37-string https://github.com/search?q=base36

kstenerud commented 5 years ago

Human input is tricky due to human vision, human handwriting, and human psychology. Humans will mess up the case, because they had caps lock on, or didn't think it would matter (because uppercase and lowercase is semantically the same character). Humans are also bad at inputting symbols, especially since they're in different locations in different international keyboards (and some don't appear at all!). As well, humans consistently mess up the following (especially when writing down by hand):

That leaves us with a viable character set of:

Char Char Char Char
0 8 g r
1 9 h s
2 a j t
3 b k v
4 c m w
5 d n x
6 e p y
7 f q z

where o can be substituted for 0, l and i can be substituted for 1, u can be substituted for v, and uppercase can be substituted for lowercase.

This is how safe32 is implemented: https://github.com/kstenerud/safe-encoding/blob/master/safe32-specification.md#alphabet

DonaldTsang commented 5 years ago

@kstenerud my idea is to consider special characters to fill in all 64 "holes" with minimal human legibility issues. I do agree that it is tricky but special characters are known to be heavily identifiable.

kstenerud commented 5 years ago

There are 32 symbols in the lower 7 bit set:

! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~

Of these:

This leaves a set of + = @ ^ for a total of 36 characters.

Even if you could somehow achieve more characters, considering that most human input codes are at most 20 characters:

Small gains for the increased human input complexity (especially considering phone input, where some of the characters are hidden 2 modifiers deep).

DonaldTsang commented 5 years ago

I would consider that $ is so unique from S that it is a counter-example of distinguishable characters. And the brackets are mostly true except curly brackets which are more unique. Other than that you are spot on. I don't get why tildes are eliminated considering it being common on standard IBM/US keyboard. Also if there is a set of characters being confusable, pick one of the set to expand character space like comma, dash and semi-colon.

kstenerud commented 5 years ago

Here are all of the special characters:

! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~

Here they are again, grouped together by confusability:

1 l i ! ( ) / | \ [ ] { }
' `
: ;
- _
, .

Non-confusable:

" # $ % & * + < = > ? @ ^ ~

These are the disallowed characters:

" # & ' / : < ? = \ > | * .

Removing the disallowed characters and the set confused with "1" leaves us with:

$ % + @ ^ ~
' ` (where ' is disallowed)
: ; (where : is disallowed)
- _
, . (where . is disallowed)

This leaves 7 "clean" characters and 3 "dirty" ones, for a total of 42.

Of these:

This will already be super confusing to grandpa trying to punch in his activation code.

All for 1 extra byte in a 20 char code (13 instead of 12). Not worth the effort and the tech support pain.

kstenerud commented 5 years ago

Actually I just noticed that tilde ~ without my reading glasses is indistinguishable from dash -.

Also, ' ` . , will be confuseable if written on paper.

DonaldTsang commented 5 years ago

No, I would make a list of alphabetic characters that are not confusing. A might look like a 4 but it is different enough. a is definitely unique. B might look like an 8 but it is different enough. b looks really like a 6. C and C are similar. D may look like o O 0 but unique enough. d is unique. E and e, F and f all look very unique. G is unique. g may look like 9 q, so we will have to decide which one to pick. H and h are unique. I would look likel 1, BUT i is unique enough (too different from ;) J and j are unique K and k are similar, so we will have to decide which one to pick. L is unique BUT l looks like 1 I. M and m, N and n are unique. O and o are like 0, so we will have to decide which one to pick. P and p are similar, so we will have to decide which one to pick. Q is unique but q looks like g 9. R and r are unique S and s are similar, but unique enough to be distinguished from 5. T is unique, t looks unique enough from + U and u, V and v, W and w, X and X, Y and y are all pair-wise similar Z and z are similar, but unique from 2 A a B C D d E e F f G H h i J j K L M m N n P Q R r S T t U V W X Y Z 0 1 2 3 4 5 6 7 8 9 $ % + @ ^ ~ \ ; - ,` is my first proposal, 55 characters?

DonaldTsang commented 5 years ago

Did another study

from math import log
for j in range(1, 65):
  a = 0.8
  for i in range(48, 57):
    b = 8*j*log(2)/log(i)%1
    if b > a:
      a = b
      print(j, i, a, 8*j*log(2)/log(i)//1)

2 48 0.8648357080166615 2.0
5 53 0.9833372019217972 6.0
7 49 0.9738012390246205 9.0
9 48 0.8917606860749778 12.0
10 53 0.9666744038435944 13.0
12 51 0.9239777013326815 16.0
13 55 0.9888478062880175 17.0
14 49 0.9476024780492409 19.0
15 53 0.9500116057653898 20.0
16 48 0.9186856641332923 22.0
17 51 0.9756350768879649 23.0
18 55 0.9076354240911009 24.0
19 50 0.931940660604802 26.0
20 53 0.9333488076871888 27.0
21 49 0.9214037170738649 29.0
21 56 0.9288328774084817 28.0
22 52 0.8748111904239941 30.0
23 48 0.9456106421916104 32.0
23 54 0.9728230883449598 31.0
24 51 0.8479554026653631 33.0
25 48 0.8104463502082737 35.0
25 53 0.9166860096089877 34.0
26 50 0.8542345881960429 36.0
26 55 0.977695612576035 35.0
27 52 0.891813733702179 37.0
28 49 0.8952049560984818 39.0
28 54 0.9234368032025557 38.0
29 51 0.89961277822065 40.0
29 56 0.9493406402307585 39.0
30 48 0.9725356202499214 42.0
31 50 0.941587393618363 43.0
32 48 0.8373713282665847 45.0
32 52 0.9088162769803532 44.0
33 54 0.8740505180601517 45.0
34 51 0.9512701537759298 47.0
35 49 0.8690061951230987 49.0
35 53 0.8833604134525714 48.0
36 55 0.8152708481822017 49.0
37 48 0.9994605983082394 52.0
38 50 0.8638813212096039 53.0
39 48 0.8642963063249027 55.0
39 55 0.9665434188640489 53.0
40 49 0.9931499372835475 56.0
41 51 0.8235904795533244 57.0
41 54 0.9950324618323165 56.0
42 49 0.8428074341477299 59.0
42 52 0.9428213635367158 58.0
43 50 0.9512341266319169 60.0
44 55 0.8853310366671394 60.0
45 53 0.8500348172961694 62.0
45 56 0.990356165875312 61.0
46 48 0.8912212843832208 65.0
46 54 0.9456461766899196 63.0
47 49 0.9669511763081715 66.0
49 49 0.8166086731723539 69.0
50 50 0.873528054223172 70.0
50 56 0.8781735176392402 68.0
51 51 0.9269052306638912 71.0
52 52 0.9768264500930712 72.0
53 48 0.9181462624415389 75.0
54 49 0.9407524153327955 76.0
55 50 0.9608808596454708 77.0
56 51 0.978562606219171 78.0
57 52 0.9938289933712525 79.0
58 56 0.898681280461517 79.0
59 52 0.800630010682525 82.0
60 48 0.9450712404998427 85.0
61 49 0.9145536543574195 86.0
62 48 0.8099069485165131 88.0
62 50 0.8831747872367259 87.0
63 51 0.8508829319965656 88.0
63 53 0.9900487442146328 87.0
64 52 0.8176325539607063 89.0
64 54 0.9678555501772763 88.0
kstenerud commented 5 years ago

In the early days of the web, many human input systems (such as captchas) were case sensitive. None of them are anymore, because analytics showed a significant dropoff in the funnel for case sensitive input systems vs case insensitive.

Even if the characters are visually different, uppercase and lowercase letters are semantically the same. Many humans will input the entire sequence in one case (all upper or all lower), regardless of what the original looked like.

The point of human input systems is to cater to their psychological and sensory biases so as to minimize erroneous or cumbersome input. Symbols are bad, mixed case is worse, and similar looking letters are right out.

DonaldTsang commented 5 years ago

@kstenerud I would say that if someone were to be able to type out special characters, then they could handle upper case as well. We are in the process of web3.0, and that human perception on characters has gone sharper for the technological age. What I hope to achieve, is a more compact "writable" binary-to-text format (more compact than safe32) that is good for comparing and typing, without impeding the human limit of character recognition. Side note: If unicode were allowed I would use unique CJKV characters if I have the chance to, but then it is not ASCII, it is not web safe, and it would take up more byte space than safe32/64/80/85. Same goes for mixing Cyrilic, Katatana+Hirigana, Hindi and Arabic scripts etc. ASCII takes up less bytes than Unicode. Also having such a system would reduce performance due to the need to manipulate Unicode.

kstenerud commented 5 years ago

We'll just have to disagree on that point, then. I understand your desire for a more compact human-writable encoding, but the savings are minimal. In a 20 character sequence (arguably the most you'd want a human to input), the difference between base 64 and 32 is 15 encoded bytes vs 12 encoded bytes. 8 bytes is already 64 bits, enough to assign a unique ID to every grain of sand on every beach and more. Adding a checksum/CRC would be 1-2 bytes, with the rest not being very useful. So for almost everyone, 10 bytes of data is plenty, which is 14 chars in base64, and 16 chars (aaaa-bbbb-cccc-dddd) in base32.

kstenerud commented 5 years ago

And yes, even though unicode should in theory be safe, there are still too many legacy shift-jis, euc, big5 etc systems around.

DonaldTsang commented 5 years ago

@kstenerud agree to disagree then in regards to "human orientation".

Also, theoretically speaking, how much can base55 (maximum base) save when compared to base32? If my calculations are correct, for a 512-bit hash base55 would require 86 characters while base32 requires 102 characters. For 256-bit hashes its 45 vs 52 characters. I could be wrong though. These two sparked my interest as well

 5 53 0.9833372019217972  6.0
13 55 0.9888478062880175 17.0

Are 128-bit arithmetic common for C libraries?

P.S. Shift-JIS and Big5 has already been deprecated by many Asian governments, so that might not be a good argument. The "wasted unicode bytes" and "lowered performance due to inconsistent character byte lengths" are good arguments on their own right.

kstenerud commented 5 years ago

I calculate 45 chunks for 256 bit and 87 for 512 bit, using a 128 bit integer size (18 chunks to 13 bytes). To keep it inside 64 bit ints, you'd need 7 chunks to 5 bytes, which would give 45 chunks again for 256 bit, and 90 chunks for 512 bit.

128 bit integer is not standard C (it's a gnu extension), and it's slow.

Also, for unicode solutions, look here: https://github.com/qntm/base2048

DonaldTsang commented 5 years ago

Firstly, if 45 chunks are constant between the two cases, maybe there is a substantial gain to base55 and other low bases (when compared to 52 characters in base32, a 15% gain). I do hope that it can be used as a basic encoding for SHA256 /512 hashes (from a human oriented perspective) among other things.

Secondly, are there any other 128-bit integer (or even 256/512-bit arithmetic) libraries that are fast? I don't think GMP is known to be fast there has to be other libraries that can do that. Would coding in pure C make it faster?

P.S. For Unicode solutions, I have seen "better" solutions https://github.com/qntm/base32768 https://github.com/rinick/base2e15 https://github.com/grandchild/base32k (off-topic)

kstenerud commented 5 years ago

I don't know of any 128 bit int libraries (haven't really looked). Mostly I'm just waiting for 128 bit to be incorporated into the C standard.

DonaldTsang commented 5 years ago

https://github.com/ridiculousfish/libdivide might have a clue on possibly faster division speeds (can't guarantee that though)

ecki commented 1 year ago

BTW we have major problems with any encodings containing $ as users routinely enter it unescaped on the command line