saxbophone / basest-python

Arbitrary base binary-to-text encoder (any base to any base), in Python.
https://pypi.org/project/basest/
Mozilla Public License 2.0
6 stars 0 forks source link

Encoder/Decoder corruption for some larger output bases #19

Closed saxbophone closed 6 years ago

saxbophone commented 8 years ago

Encountered an issue decoding symbols that were encoded from base 128 to base 255. I have a hunch that this is because the ratios are not exact and the output base is larger than the input base.

Currently, for all cases when decoding, empty padding symbols are converted to MAX just before decoding, like in base-85. I think this might only work when the input base is larger than the output base, so a different approach may be needed for when the output base is larger.

Code for Encoder class:

from basest.encoders import Encoder

class StrictAsciiSquashEncoder(Encoder):
    input_base = 128
    output_base = 255
    input_ratio = 9
    output_ratio = 8
    # The Strict ASCII Set
    input_symbol_table = [
        s for s in
        '\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f'
        '\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f'
        ' !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_'
        '`abcdefghijklmnopqrstuvwxyz{|}~\x7f'
    ]
    # Bytes 0 to 254
    output_symbol_table = [
        s for s in
        '\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f'
        '\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f'
        ' !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_'
        '`abcdefghijklmnopqrstuvwxyz{|}~\x7f'
        '\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f'
        '\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f'
        '\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf'
        '\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf'
        '\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf'
        '\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf'
        '\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef'
        '\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe'
    ]
    padding_symbol = '\xff'

Sample decoding errors:

>>> sa = StrictAsciiSquashEncoder()
>>> 
>>> ''.join(sa.encode('slartybartfast'))
'z_\x92d$\xceW\xce\xf6\x11t\x0b\xff\xff\xff'
>>> sa.decode(''.join(sa.encode([s for s in 'slartybartfast'])))
['s', 'l', 'a', 'r', 't', 'y', 'b', 'a', 'r', 't', 'f', 'a', 's', 'v']
saxbophone commented 7 years ago

I've just thought of a potential work-around for this:

  1. If the input data length is not a multiple of the input ratio, then it is padded with 0 or MAX symbols to make it a multiple of the input ratio.
  2. It is then converted as normal.
  3. A number of padding symbols are appended to the output, the same number as 0 or MAX symbols that were added to pad the input to be an acceptable size.
  4. At decode time, these symbols are counted and used to know how much the original input data was padded (and hence how much to strip off of the output).
saxbophone commented 6 years ago

Closing this, as I don't think it's possible to use padding that is compatible with base64, Ascii85, etc... with larger output bases. If a user wants to use a larger output base than input when encoding, they will just have to only pass exact chunk-sized bits of data to it.