String Field does not handle non-UTF8 values correctly

cheukwing commented 8 months ago

The String field (and potentially other fields) does not handle the encoding of non-UTF8 values correctly, leading to incorrect lengths.

For example, take the value hüllo, and encoding EBCDIC 1047.

In UTF-8 hexadecimal bytes, this is 0x68 0xC3 0xBC 0x6C 0x6C 0x6F, where ü is two bytes 0xC3 0xBC. In EBCDIC 1047, this is 0x88 0xDC 0x93 0x93 0x96, where ü is one byte 0xDC.

In the String field, padding is done before the data is encoded, and assumes the length of the data in UTF8 will be the same as the length of the data in the encoded format.

This means that when we specify a fixed length of 10 with padding, we get 0x40 0x40 0x40 0x40 0x88 0xDC 0x93 0x93 0x96, which is only length 9. When we specify a variable LL length, we get 0xF0 0xF6 0x88 0xDC 0x93 0x93 0x96, which encoded that it has a length of 6 0xF0 0xF6, yet the actual data only has a length of 5.

spec := &Spec{
    Length:      10,
    Description: "Field",
    Enc:         encoding.EBCDIC1047,
    Pref:        prefix.EBCDIC1047.Fixed,
    Pad:         padding.Left(' '),
}
str := NewString(spec)
str.SetValue("hüllo")

packed, _ := str.Pack()
fmt.Printf("%X\n", packed) // 4040404088DC939396

lengthedSpec := &Spec{
    Length:      10,
    Description: "Field",
    Enc:         encoding.EBCDIC1047,
    Pref:        prefix.EBCDIC1047.LL,
}
str = NewString(lengthedSpec)
str.SetValue("hüllo")
packed, _ = str.Pack()
fmt.Printf("%X\n", packed) // F0F688DC939396

alovak commented 7 months ago

Hey @cheukwing! Thanks for reporting the issue. I'm not sure how to solve this issue, but PR your created #298 reverts this PR https://github.com/moov-io/iso8583/pull/227 which fixed the issue, when encoded BCD length was not equal to the original length (for 2 input bytes / ASCII chars you get 1 output byte / hex). Following that PR we assume that the values we work with (when we specify the field Length) are ASCII chars, int64 or bytes.

I'm not sure how we can support UTF8 strings. Here is how we can do it by working with runes instead of bytes: https://github.com/moov-io/iso8583/pull/299. I took your branch and test in it.

adamdecaf commented 7 months ago

To support utf8 strings wouldn't we need to encode []rune? []byte wouldn't work because a rune can be more than one byte.

alovak commented 7 months ago

@adamdecaf I found that we can use utf8.RuneCount(data) to address this. I also think that we should switch to runes only in String and padding package (it's uses runes already, but not everywhere).

In the https://github.com/moov-io/iso8583/pull/299 I did exactly this: switched to runes without modifying the logic. All tests that were created by @cheukwing in his PR pass.

alovak commented 7 months ago

@cheukwing the utf8 support is tricky. I'm still not sure how we should address it.

While #299 addresses your need, I added one more test to show the controversy of the change:

https://github.com/moov-io/iso8583/pull/299/files#diff-628a377b7ef394e8fa0b877233eee6e2fbc4617798fe045b7cc8f31935c92f51R63

In the test, the Spec encoding is Binary. The test fails, as the value we pack and unpacked value do not match 🤷‍♂️ and the packed length is not 10 as expected by the spec.

    spec := &Spec{
        Length:      10,
        Description: "Field",
        Enc:         encoding.Binary, 
        Pref:        prefix.Binary.Fixed,
        Pad:         padding.Left(' '),
    }
    str := NewStringValue("hüllo")
    str.SetSpec(spec)
    packed, err := str.Pack()
    require.NoError(t, err)

    assert.Len(t, packed, 10) // fails

    str2 := NewString(spec)
    _, err = str2.Unpack(packed)
    require.NoError(t, err)

    assert.Equal(t, "hüllo", str2.Value()) // fails

The #299 PR works only because of custom encoding that encodes utf8 char into single byte char. It's not possible to have a fixed length for the utf8 string as when you read data from the network you read N bytes, not N utf8 chars.

Maybe the solution here is to create custom field? We can use runes but I think that may be confusing in the future. Please, share your thoughts about it all.

cheukwing commented 7 months ago

The issue is not so much about supporting UTF-8, but about (fully) supporting encodings which are not subsets of UTF-8 (e.g. EBCDIC1047). Currently, because Go strings are UTF-8, the encoded length/padding of EBCDIC1047 strings which include characters which are two bytes in UTF-8 will be incorrect. We do not necessarily need to use runes everywhere to support this.

I tried your new test on my original PR, and it passed, although I understand that adding this padding logic in the field is not ideal. The problem with the current implementation is that we encode the data before we pad it, so we do not know how many padding characters we should add if the number of runes differs from the encoded length.

adamdecaf commented 7 months ago

Sounds like we need to read/write the raw bytes to fully support EBCDIC1047 and translate them into Go runes.

moov-io / iso8583

String Field does not handle non-UTF8 values correctly #297