TryteString.as_bytes() does not work as expected

th0br0 commented 7 years ago

Expected result (or similar):

>>> iota.TryteString(b'ABABZZDD99').as_bytes()
(30, [55,178,248,107,12,0])

Actual result:

>>> iota.TryteString(b'ABABZZDD99').as_bytes()
Traceback (most recent call last):
  File "iota.lib.py/iota/codecs.py", line 147, in decode
    + (self.index[second] * len(self.index))
ValueError: byte must be in range(0, 256)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "iota.lib.py/iota/types.py", line 421, in as_bytes
    return decode(self._trytes, 'trytes', errors)
  File "iota.lib.py/iota/codecs.py", line 165, in decode
    'input': input,
iota.codecs.TrytesDecodeError: 'trytes' codec can't decode trytes ZZ at position 4-5: ordinal not in range(255)

todofixthis commented 7 years ago

Hey @th0br0. This is actually expected behaviour — the tryte sequence ZZ represents value -28, which cannot be converted into a single byte.

>>> from iota.types import int_from_trits, TryteString

>>> int_from_trits(TryteString('ZZ').as_trits())
-28

>>> bytearray([-28])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: byte must be in range(0, 256)
byte must be in range(0, 256)

In order to convert this sequence into a byte string, you must instruct PyOTA to handle invalid sequences differently (this is very similar to how string.decode('utf-8') handles invalid bytes):

>>> from iota import TryteString

>>> TryteString(b'ABABZZDD99').as_bytes('strict')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/phx/Documents/pyota/iota/types.py", line 498, in as_bytes
    return decode(self._trytes, 'trytes', errors)
  File "/Users/phx/Documents/pyota/iota/codecs.py", line 165, in decode
    'input': input,
iota.codecs.TrytesDecodeError: 'trytes' codec can't decode trytes ZZ at position 4-5: ordinal not in range(255)
'trytes' codec can't decode trytes ZZ at position 4-5: ordinal not in range(255)

>>> TryteString(b'ABABZZDD99').as_bytes('replace')
b'77?p\x00'

>>> TryteString(b'ABABZZDD99').as_bytes('ignore')
b'77p\x00'

th0br0 commented 7 years ago

I beg to differ. I'm not well-versed enough in Python3 codecs to pinpoint exactly where things need to change, but: as_bytes() should split the trits into chunks of up to 5 trits each and then essentially do:

value = 0
RADIX = 3
chunk = [-1,1,1,1,-1]
for c in chunk:
  value = value * RADIX + c

the resulting value is signed (-69) can then be converted into an unsigned byte via value & 0xFF == 187. If the final chunk has less than 5 elements, expect them to be 0 (but multiply by the RADIX anyway). See https://github.com/iotaledger/iota.lib.java/blob/master/src/main/java/jota/utils/Converter.java#L56 for the Java impl.

Decomposition can happen either via a lookup table computed at library load, e.g. https://github.com/iotaledger/iota.lib.java/blob/master/src/main/java/jota/utils/Converter.java#L37, or have a look at the code at the bottom. It's equal to what the Java code does.

Note that you need to store the number of encoded trits for decomposition of multiple bytes to work.

This is what both the Java & Rust libraries do, I don't think that the JS library currently implements byte conversion. The current Python library can't even do iota.TryteString(b'Z').as_bytes() ?!

RADIX = 3
MAX_TRIT_VALUE = 1
MIN_TRIT_VALUE = -1

def increment(trits):
    for i in range(5):
        trits[i] += 1
        if trits[i] > MAX_TRIT_VALUE:
            trits[i] = MIN_TRIT_VALUE
        else:
            break
    return trits

if __name__ == '__main__':
    trits = [0] * 5
    for i in range(243):
        value = 0
        # to_bytes
        for t in trits[::-1]:
            value = value * RADIX + t

        print("%s=%d"% (trits, value & 0xFF))

        # to_trits
        # there's 243 different value combinations
        # this is just easier than having to worry about signedness when converting back...
        value += 121

        chunk = []
        for _ in range(5):
            rem = value % RADIX
            chunk.append(int(rem))
            value = (value - rem) / RADIX
        chunk = [x - 1 for x in chunk]

        assert(trits == chunk)

        trits = increment(trits)

todofixthis commented 7 years ago

Thanks for the thorough explanation! I see now what I was missing before.

TryteString.as_bytes() was modelled after the JS lib's fromTrytes function. But, it's not a proper "bytes from trytes" implementation, as you've noted.

I like the idea of implementing a proper trytes -> bytes conversion process (and I'm excited to finally see some maths showing how it works!); maybe we can rename TryteString.as_bytes() to as_ascii_encoded or similar, and then as_bytes can do the proper trytes -> bytes conversion.

Given everything that's going on with PyOTA right now, I think this will have to wait until v2.1 (v2.0 has the new hash algorithm and multisig — there's already enough backwards-incompatible changes for one release!).

In the meantime, I think you can get the result you're looking for by first calling TryteString.as_trits() and then using the code you posted to convert to byte values.

todofixthis commented 6 years ago

Hey @th0br0 I'm starting to work on this now. Would you be willing to help me come up with some test cases for this functionality?

There's a few scenarios that I'm interested in:

At least two "happy path" scenarios (one long sequence, one short sequence), both directions (trits <-> bytes).
Negative values.
Zero/empty values:
- Should trits [0] be converted to bytes b'' or b'\0'?
- What about [0, 0, 0, ...]?
- What about []?
Edge cases for 5-trits-per-byte (from what I understand, it's not exactly 5 trits to 1 byte — are there any sequences of trits or bytes that we can use to demonstrate this?).
Others?

@alon-e @paulhandy Would love to get your thoughts on this as well. What are some unit tests we can use for trits <-> bytes conversion?

mlouielu commented 6 years ago

Hi all, I'm currently working on Python reading rocksdb data, and it is relative to this topic.

Problem

I think TryteString naming and lack of documentation only confusing Python users.

TryteString should act like str, not bytes:

the name TryteString should just act like the str object in Python, not bytes, otherwise, it should be named as TryteBytes.

TryteString's method from_string, as_string, from_bytes, as_bytes confuse:

from_string will encode str with utf-8, but iota.TryteString(str) only accept tryte alphabet, not act like from_string. from_bytes take normal bytes, but not the bytes from trits.

TryteString misused bytes:

In Python, bytes is kind of representation of codepoint, that is, '你好' in UTF-8 will be 'b'\xe4\xbd\xa0\xe5\xa5\xbd', but in TryteString, we use bytes as a limited check for the input in the trytes alphabet range, and make it confuse.

For examle, TryteString('HELLO') == TryteString(b'HELLO').

For TryteString, its bytes should be TryteString convert to bytes, e.g. TryteString('EXAMPLE').as_bytes() to b'\xb4T\xa1\xa0\x01'

How to fixed it

Deprecate the usage of TryteString.

TryteString internal representation as str (trytes), and correct the behavior of the method. This will be a big change since all around the library was using TryteString with bytes.

Good

TryteString('EXAMPLE')                                   # input: trytes: EXAMPLE, output "EXAMPLE"
TryteString('OBGCKBWBZBVBOB').as_string() # input: trytes, output "EXAMPLE"
TryteString.from_string('EXAMPLE')             # input: str: EXAMPLE, output "OBGCKBWBZBVBOB"
TryteString.from_string('你好')                      # input: str: 你好, output: "LH9GYEMHCF9G"
TryteString.from_bytes(b'\xb4T\xa1\xa0\x01')  # input: tryte-bytes, output "EXAMPLE"
bytes(TryteString('EXAMPLE'))                      # input: trytes, output b'EXAMPLE'

Bad

TryteString('你好')                  # input should be trytes-string (in trytes-alphabet)
TryteString('@@##HELLO')  # input should be trytes-string (in trytes-alphabet)
TryteString(b'EXAMPLE')       # input should not be bytes
TryteString.from_string(b'EXAMPLE')  input should be str type
TryteString.from_bytes(b'EXAMPLE')   # input should be tryte-bytes (trytes-string binary format)

Rename TryteString to TryteBytes, and correct from_bytes, as_bytes behavior.

todofixthis commented 6 years ago

Hey @mlouielu thanks for the info and the suggestions!

I see what you're getting at with TryteBytes. I don't think we need to go that far, but there is definitely room for improvement around the way that TryteString handles conversion to/from str and bytes values.

Let's figure out the trits <-> bytes conversion first, and then we can move on to trytes <-> characters.

The most important thing we need right now to fix this issue is test cases. Could you help with https://github.com/iotaledger/iota.lib.py/issues/62#issuecomment-336674870 ?

todofixthis commented 6 years ago

Thanks @mlouielu for the test cases!! I will look at them this weekend, and I'll let you know if I have any questions.

vbakke commented 6 years ago

Hi @todofixthis

Your latest commit regarding trytes was naming only. Is that correct? Or did you change some of the behaviour as well?

There is a lot of concurrent work on the byte <-> tryte conversion. Both in #62 and #90 by you and @th0br0, and @mlouielu in the rocksdb.

I work on encrypting IOTA seeds (somewhat similar to BIP38), and @dashengSun (with feedback from @paulhand) work on Unicode to trytes in the iota.lib.js and @pRizz has published pRizz/Tryte-UTF8-JSON-Codec, @tuupola has tuupola/trytes (for PHP) and @knarz has knarz/trytes (for Rust).

Maybe we should coordinate a little, share the good ideas, and pin point difficulties before they get published and confuse the rest of the community? Because tryte <-> byte conversion is not trivial.

Square pegs and round holes The English saying about fitting a square peg in a round hole is very describing. There will be trouble with the function that converts back to the original type.

We probably should not pollute issue #62 with this discussion. Please suggest a better place. If we lack better options, I just now created vbakke/trytes where we can discuss this freely.

todofixthis commented 6 years ago

Hey @vbakke About 4 months ago, I started work on this issue. I thought we'd have it wrapped up in a couple of weeks, so I made some changes to the codebase to make room for the new codec and mark the existing codec as deprecated.

However, at this point, I'm still not sure that we have a bytes <-> trytes conversion process that can handle all edge cases (e.g., bytes in range [244, 255], certain cases where encode(decode(value)) != value, etc.), so I haven't made much progress on this issue.

vbakke commented 6 years ago

Hi @todofixthis ,

I've been spending most of my spare time during xmas looking at this as well. I just now published my latest thoughts on the issue, as well as working example code (JavaScript, though).

I think it covers all corner cases, and includes Unicode text to tryte strings, including a BOM mark to indicate the encoding, if used.

Feel free to have a look at my latest write up of the Readme.md. I don't think there is a universal two-way byte <->tryte conversion.

todofixthis commented 6 years ago

Thanks @vbakke! I'll have a look at your notes later this week. Quick glance shows that they are very thorough; looking forward to going through them!

vbakke commented 6 years ago

FYI I just added an online version of encoding native strings, and native trytes at https://vbakke.github.io/trytes/

iotaledger / iota.py

TryteString.as_bytes() does not work as expected #62

Problem

How to fixed it