multiformats / multibase

Self identifying base encodings
291 stars 74 forks source link

Base94: all of ascii except control and whitespace #102

Closed Jorropo closed 1 year ago

Jorropo commented 2 years ago

All of ascii in order from ! to ~ (inclusive) following a classical radix strategy, no padding.

Key: ~

This trade maximum density while still being usable and exclusively ascii. Sadly this alphabet is secable (following unicode's rules it has characters which do word breaks) but it's the kind of tradeoffs you have to make when you want compactness, it's probably also slow to encode and decode, again tradeoffs have to be made.

No padded version because we lack a free character to do that with, also if you care about compactness you probably don't care about padding.

If no one has anything to object about it, I'll open a PR for a spec and a go implementation in the future.

Jorropo commented 2 years ago

/cc @mriise

rvagg commented 2 years ago

ok, but do you have an actual usecase for it? I think my objection is expanding the list of bases just because we can. I can see this one being very difficult to use in practice, particularly because of the difficulty escaping such characters, it'd probably want to be isolated to a narrow usecase I imagine.

fine if there's an application for this, otherwise this feels a little like noise?

AaronGoldman commented 1 year ago

Why 94 over 85? Some of these chars will need escaping in many string literals.

Lex85 A base85 encoding with stable lexicographic sorting to the raw bytes

We want to put bytes into ascii strings with more density then b64. Our target environment for Lex85 is CSV, TSV, JSON, and programming language literal strings. If we look at how many bytes a char if encoded in json is their are 93 chars that are encoded as a single byte.

In [0]: import json
In [1]: {i:chr(i) for i in range(256) if len(json.dumps(chr(i)))<=3}  # "c"
Out[1]:
 32: ' ',  33: '!',  35: '#',  36: '$',  37: '%',  38: '&',  39: "'",  40: '(',
 41: ')',  42: '*',  43: '+',  44: ',',  45: '-',  46: '.',  47: '/',  48: '0',
 49: '1',  50: '2',  51: '3',  52: '4',  53: '5',  54: '6',  55: '7',  56: '8',
 57: '9',  58: ':',  59: ';',  60: '<',  61: '=',  62: '>',  63: '?',  64: '@',
 65: 'A',  66: 'B',  67: 'C',  68: 'D',  69: 'E',  70: 'F',  71: 'G',  72: 'H',
 73: 'I',  74: 'J',  75: 'K',  76: 'L',  77: 'M',  78: 'N',  79: 'O',  80: 'P',
 81: 'Q',  82: 'R',  83: 'S',  84: 'T',  85: 'U',  86: 'V',  87: 'W',  88: 'X',
 89: 'Y',  90: 'Z',  91: '[',  93: ']',  94: '^',  95: '_',  96: '`',  97: 'a',
 98: 'b',  99: 'c', 100: 'd', 101: 'e', 102: 'f', 103: 'g', 104: 'h', 105: 'i',
106: 'j', 107: 'k', 108: 'l', 109: 'm', 110: 'n', 111: 'o', 112: 'p', 113: 'q',
114: 'r', 115: 's', 116: 't', 117: 'u', 118: 'v', 119: 'w', 120: 'x', 121: 'y',
122: 'z', 123: '{', 124: '|', 125: '}', 126: '~'

we need to pick 85 chars for our alphabet

giving us the alphabet "#$%&()*+-0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_abcdefghijklmnopqrstuvwxyz{|}"

Note the alphabet is in lexicographic sort order this is important since it allows for the encoded strings and decoded bytes to sort in the same order.

 -------------------------------- 
| C1    | C2   | C3  | C4  | C5  |
|--------------------------------|
| i unsigned 32 bit integer      |
|--------------------------------|
| byte1 | byte2 | byte3 | byte4  |
 --------------------------------

i = (C1 * 85 ** 4) + (C2 * 85 ** 3) + (C3 * 85 ** 2) + (C4 * 85)+ C5
i = (b1 << 24) + (b2 << 16) + (b3 << 8) + (b4 << 0)

if the number of chars is not modulo 5 then pad with 84 if the number of bytes is not modulo 4 then pad with 0 for the conversions but then drop the pad from the output

base85_alphabet = "#$%&()*+-0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_abcdefghijklmnopqrstuvwxyz{|}" 
integer2character85 = base85_alphabet
character2integer85 = {character: index for index, character in enumerate(base85_alphabet)}

def encode85(buffer):
    pad = (-(len(buffer) % -4))
    buffer = buffer + b'\x00'*pad
    encoded = [''] * ((len(buffer)//4)*5)
    for i in range(0, len(buffer)//4):
        integer = ((buffer[4 * i + 0] << 24)|
                   (buffer[4 * i + 1] << 16)|
                   (buffer[4 * i + 2] << 8 )|
                   (buffer[4 * i + 3] << 0 ))
        encoded[5 * i + 0] = integer2character85[integer // (85**4) % 85]
        encoded[5 * i + 1] = integer2character85[integer // (85**3) % 85]
        encoded[5 * i + 2] = integer2character85[integer // (85**2) % 85]
        encoded[5 * i + 3] = integer2character85[integer // (85**1) % 85]
        encoded[5 * i + 4] = integer2character85[integer // (85**0) % 85]
        if integer >> 32:
            return OverflowError(f"{integer} > uint32_max")
    return ''.join(encoded)[:len(encoded)-pad]

def decode85(string):
    pad = (-(len(string) % -5))
    string = string + '~'*pad
    buffer = bytearray(len(string) // 5 * 4)
    for i in range(0, len(string) // 5):
        integer = (character2integer85[string[i * 5 + 0]] * (85 ** 4) +  
                   character2integer85[string[i * 5 + 1]] * (85 ** 3) +
                   character2integer85[string[i * 5 + 2]] * (85 ** 2) +
                   character2integer85[string[i * 5 + 3]] * (85 ** 1) +
                   character2integer85[string[i * 5 + 4]] * (85 ** 0))
        buffer[i * 4 + 0] = integer >> 24 & 0xff
        buffer[i * 4 + 1] = integer >> 16 & 0xff
        buffer[i * 4 + 2] = integer >>  8 & 0xff
        buffer[i * 4 + 3] = integer >>  0 & 0xff
    return bytes(buffer[:len(buffer)-pad])
Jorropo commented 1 year ago

I don't care about this feel free to reopen a new issue yourself if you do.