Closed Jorropo closed 1 year ago
/cc @mriise
ok, but do you have an actual usecase for it? I think my objection is expanding the list of bases just because we can. I can see this one being very difficult to use in practice, particularly because of the difficulty escaping such characters, it'd probably want to be isolated to a narrow usecase I imagine.
fine if there's an application for this, otherwise this feels a little like noise?
Why 94 over 85? Some of these chars will need escaping in many string literals.
Lex85 A base85 encoding with stable lexicographic sorting to the raw bytes
We want to put bytes into ascii strings with more density then b64. Our target environment for Lex85 is CSV, TSV, JSON, and programming language literal strings. If we look at how many bytes a char if encoded in json is their are 93 chars that are encoded as a single byte.
In [0]: import json
In [1]: {i:chr(i) for i in range(256) if len(json.dumps(chr(i)))<=3} # "c"
Out[1]:
32: ' ', 33: '!', 35: '#', 36: '$', 37: '%', 38: '&', 39: "'", 40: '(',
41: ')', 42: '*', 43: '+', 44: ',', 45: '-', 46: '.', 47: '/', 48: '0',
49: '1', 50: '2', 51: '3', 52: '4', 53: '5', 54: '6', 55: '7', 56: '8',
57: '9', 58: ':', 59: ';', 60: '<', 61: '=', 62: '>', 63: '?', 64: '@',
65: 'A', 66: 'B', 67: 'C', 68: 'D', 69: 'E', 70: 'F', 71: 'G', 72: 'H',
73: 'I', 74: 'J', 75: 'K', 76: 'L', 77: 'M', 78: 'N', 79: 'O', 80: 'P',
81: 'Q', 82: 'R', 83: 'S', 84: 'T', 85: 'U', 86: 'V', 87: 'W', 88: 'X',
89: 'Y', 90: 'Z', 91: '[', 93: ']', 94: '^', 95: '_', 96: '`', 97: 'a',
98: 'b', 99: 'c', 100: 'd', 101: 'e', 102: 'f', 103: 'g', 104: 'h', 105: 'i',
106: 'j', 107: 'k', 108: 'l', 109: 'm', 110: 'n', 111: 'o', 112: 'p', 113: 'q',
114: 'r', 115: 's', 116: 't', 117: 'u', 118: 'v', 119: 'w', 120: 'x', 121: 'y',
122: 'z', 123: '{', 124: '|', 125: '}', 126: '~'
we need to pick 85 chars for our alphabet
32: ' '
since it is ambiguous with tab33: '!'
to reserve it as less then all lex85 chars34: '"'
since it is the json string container39: "'"
since it is a string container44: ','
since it is a delimiter46: '.'
since it is a delimiter47: '/'
since it is a delimiter92: '\'
since it is the json string escaping96: '`'
since it is a string container126: '~'
to reserve it as less then all lex85 charsgiving us the alphabet
"#$%&()*+-0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_abcdefghijklmnopqrstuvwxyz{|}"
Note the alphabet is in lexicographic sort order this is important since it allows for the encoded strings and decoded bytes to sort in the same order.
--------------------------------
| C1 | C2 | C3 | C4 | C5 |
|--------------------------------|
| i unsigned 32 bit integer |
|--------------------------------|
| byte1 | byte2 | byte3 | byte4 |
--------------------------------
i = (C1 * 85 ** 4) + (C2 * 85 ** 3) + (C3 * 85 ** 2) + (C4 * 85)+ C5
i = (b1 << 24) + (b2 << 16) + (b3 << 8) + (b4 << 0)
if the number of chars is not modulo 5 then pad with 84 if the number of bytes is not modulo 4 then pad with 0 for the conversions but then drop the pad from the output
base85_alphabet = "#$%&()*+-0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_abcdefghijklmnopqrstuvwxyz{|}"
integer2character85 = base85_alphabet
character2integer85 = {character: index for index, character in enumerate(base85_alphabet)}
def encode85(buffer):
pad = (-(len(buffer) % -4))
buffer = buffer + b'\x00'*pad
encoded = [''] * ((len(buffer)//4)*5)
for i in range(0, len(buffer)//4):
integer = ((buffer[4 * i + 0] << 24)|
(buffer[4 * i + 1] << 16)|
(buffer[4 * i + 2] << 8 )|
(buffer[4 * i + 3] << 0 ))
encoded[5 * i + 0] = integer2character85[integer // (85**4) % 85]
encoded[5 * i + 1] = integer2character85[integer // (85**3) % 85]
encoded[5 * i + 2] = integer2character85[integer // (85**2) % 85]
encoded[5 * i + 3] = integer2character85[integer // (85**1) % 85]
encoded[5 * i + 4] = integer2character85[integer // (85**0) % 85]
if integer >> 32:
return OverflowError(f"{integer} > uint32_max")
return ''.join(encoded)[:len(encoded)-pad]
def decode85(string):
pad = (-(len(string) % -5))
string = string + '~'*pad
buffer = bytearray(len(string) // 5 * 4)
for i in range(0, len(string) // 5):
integer = (character2integer85[string[i * 5 + 0]] * (85 ** 4) +
character2integer85[string[i * 5 + 1]] * (85 ** 3) +
character2integer85[string[i * 5 + 2]] * (85 ** 2) +
character2integer85[string[i * 5 + 3]] * (85 ** 1) +
character2integer85[string[i * 5 + 4]] * (85 ** 0))
buffer[i * 4 + 0] = integer >> 24 & 0xff
buffer[i * 4 + 1] = integer >> 16 & 0xff
buffer[i * 4 + 2] = integer >> 8 & 0xff
buffer[i * 4 + 3] = integer >> 0 & 0xff
return bytes(buffer[:len(buffer)-pad])
I don't care about this feel free to reopen a new issue yourself if you do.
All of ascii in order from
!
to~
(inclusive) following a classical radix strategy, no padding.Key:
~
This trade maximum density while still being usable and exclusively ascii. Sadly this alphabet is secable (following unicode's rules it has characters which do word breaks) but it's the kind of tradeoffs you have to make when you want compactness, it's probably also slow to encode and decode, again tradeoffs have to be made.
No padded version because we lack a free character to do that with, also if you care about compactness you probably don't care about padding.
If no one has anything to object about it, I'll open a PR for a spec and a go implementation in the future.