Can't combine output types for strings with different encodings

hallipr commented 4 years ago

I need to parse null-termintated strings with conditional encoding utf16 or utf8 The strings are stored as Length + Value, with length negated for utf16.

If I use separate types with type switching wide_string (utf16) and narrow_string (utf8), when comparing strings, I'm forced to continually cast them to the same type. I'm not sure I can do this when I compile to C#.

I've tried several approaches that avoid type switching and casting:

ternary operators in encoding

  encoded_string:
    seq:
      - id: length
        type: s4
      - id: value
        type: strz
        size: is_wide ? length * -2 : length
        encoding: is_wide ? utf16 : utf8
    instances:
      is_wide:
        value: length < 0

error: The encoding label provided ... is invalid.

Call stack: undefined RangeError: Failed to construct 'TextDecoder': The encoding label provided ('is_wide ? utf8 : utf16') is invalid.

byte array with ternary to_s

  encoded_string:
    seq:
      - id: length
        type: s4
      - id: bytes
        size: is_wide ? length * -2 : length
    instances:
      is_wide:
        value: length < 0
      value:
        value: is_wide ? bytes.to_s(utf16) : bytes.to_s(utf8)

error: don't know how to call method 'identifier(to_s)'

RuntimeException: don't know how to call method 'identifier(to_s)' of object type 'BytesLimitType(IfExp(Name(identifier(is_wide)),BinOp(Name(identifier(length)),Mult,UnaryOp(Minus,IntNum(2))),Name(identifier(length))),None,false,None,None)'

I was surprised that field "bytes" was BytesLimitType and not a raw byte array.

conditional fields and a ternary instance value

  encoded_string:
    seq:
      - id: length
        type: s4
      - id: narrow
        type: strz
        size: length
        encoding: utf8
        if: is_wide == false
      - id: wide
        type: strz
        size: length * -2
        encoding: utf16
        if: is_wide
    instances:
      is_wide:
        value: length < 0
      value:
        value: (is_wide ? wide : narrow)

error: can't combine output types

    /types/encoded_string/instances/value/value: can't combine output types: StrFromBytesType(BytesLimitType(BinOp(Name(identifier(length)),Mult,UnaryOp(Minus,IntNum(2))),Some(0),false,None,None),utf16) vs StrFromBytesType(BytesLimitType(Name(identifier(length)),Some(0),false,None,None),utf8)

This works but is hacky: Conditional fields and ternary instance, forcing string type using .reverse.reverse or .substring(0, length).

  encoded_string:
    seq:
      - id: length
        type: s4
      - id: narrow
        type: strz
        size: length
        encoding: utf8
        if: is_wide == false
      - id: wide
        type: strz
        size: length * -2
        encoding: utf16
        if: is_wide
    instances:
      is_wide:
        value: length < 0
      value:
        value: is_wide ? wide.substring(0, -length - 1) : narrow.substring(0, length - 1)

generalmimon commented 4 years ago

ternary operators in encoding

Unfortunately, key encoding supports only literal constant values, it doesn't support KS expressions. So you really have to create two separate attributes and switch them using if.

And note that even if it supported expressions, any encoding identifier would be a string, and every string literal in KS has to be enclosed in quotes:

encoding: 'is_wide ? "utf16" : "utf8"'
encoding: '"utf-8"' # ' for YAML, " for KS expression

byte array with ternary to_s

don't know how to call method 'identifier(to_s)' of object type 'BytesLimitType(...)'

This bug has been fixed in 0.9 in commit https://github.com/kaitai-io/kaitai_struct_compiler/commit/375a140bb96adeb4c6031e478fad4e03eeabfefb. Make sure you have the latest development 0.9 KS compiler installed (https://kaitai.io/#download), or make your life easier and use the devel Web IDE, which has always the latest KSC.

The other thing is that this expression

        value: 'bytes.to_s(utf8)'

is incorrect, because the string utf8 must be enclosed in quotes to be a string literal ("utf8"). Referencing utf8 would access the value of an attribute with id: utf8, which can be a seq field or for example a value instance:

instances:
  utf8:
    value: 'true ? "utf8" : "ascii"'

If any attribute called utf8 (referenced in the expression bytes.to_s(utf8)) doesn't exist, the compiler in the devel Web IDE should throw an error like unable to access 'utf8' in test::encoded_string context, but that unfortunately doesn't happen for some reason. Let's add this to the growing list of compiler bugs :-|

I was surprised that field "bytes" was BytesLimitType and not a raw byte array.

BytesLimitType actually is a raw byte array. This is just an internal compiler data type of a seq or instances field parsed from stream defined using the size key. If the byte array would be delimited using terminator byte, it would be called BytesTerminatedType and so on.

The DataType class inside the compiler internals stores all information needed to be able to parse an attribute, i.e. it is a more convenient representation of parsing-relevant keys in the attribute definition. For example, the BytesLimitType case class looks like this:

  case class BytesLimitType(
    size: Ast.expr,
    terminator: Option[Int],
    include: Boolean,
    padRight: Option[Int],
    override val process: Option[ProcessExpr]
  ) extends BytesType

But if everything works correctly, this type diversity should not be a problem when you're e.g. working with BytesLimitType and CalcBytesType (i.e. created from a byte literal like [0xca, 0xfe]) in the same value instance or when you're calling some methods on them, because they all derive from an abstract class BytesType, and this type is used anywhere where it doesn't matter which specific byte array it is.

conditional fields and a ternary instance value

This is an instance of issue https://github.com/kaitai-io/kaitai_struct/issues/318, and again, the solution is to download the latest development (unstable) 0.9 version of KSC or use the devel Web IDE.

dgelessus commented 4 years ago

This isn't directly related to the original question, but you should be aware that strz currently doesn't work correctly in combination with "wide" encodings like UTF-16 - see #187. In your case it should be easy to work around this bug though - because you have the exact length of the string, you could use str instead of strz and remove the zero terminator by reducing the length by 1 character, or using substring after reading the string.

hallipr commented 4 years ago

Thanks. I'll do that

On Sat, Sep 19, 2020, 2:27 PM dgelessus notifications@github.com wrote:

This isn't directly related to the original question, but you should be aware that strz currently doesn't work correctly in combination with "wide" encodings like UTF-16 - see #187 https://github.com/kaitai-io/kaitai_struct/issues/187. In your case it should be easy to work around this bug though - because you have the exact length of the string, you could use str instead of strz and remove the zero terminator by reducing the length by 1 character, or using substring after reading the string.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kaitai-io/kaitai_struct/issues/810#issuecomment-695357762, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJ3K4T75UB5BVYKYXZVADLSGUO43ANCNFSM4RSRVBYQ .

webbnh commented 4 years ago

@hallipr, I'd just like to test your expectations, here. UTF-8 and UTF-16 strings really are different data types: you cannot trivially cast from one to the other -- they have to be converted. So, even if you could get a single field to contain either one, I suspect that it wouldn't be long before the downstream code ran into problems. So, it strikes me that working hard to get the KSY to paper over the differences isn't actually going to save you anything.

dgelessus commented 4 years ago

@webbnh You're right that the string representation in the raw data is different, but immediately after the data is read, the generated parser will convert it into the target language's native string representation. So even though the raw data may have different encodings, the value stored in the field will always have the same type (e. g. String in Java, str in Python), so when using the parsed field it no longer matters what encoding the string data had originally.

webbnh commented 4 years ago

@dgelessus, how does that work out in C++? std::string is not the same as std::wstring there, and each of those is different from std::u16string. (You cannot easily convert wide characters to narrow characters, because some of them won't fit! ;-) And, if you convert narrow characters to wide characters, then the downstream code has to know about that in order to avoid passing them to things which expect ASCII.)

dgelessus commented 4 years ago

It's a bit more confusing when KS targets C++, because it represents both byte arrays and "true strings" (KS str/strz) as std::string. When a KS parser reads a raw byte array, the read data is directly read into a std::string and stored in the field. When it reads a "true string", it first reads the raw data into a std::string, then passes it through the runtime function kaitai::kstream::bytes_to_str, which returns another std::string containing a converted version of the raw data, and that converted string is then stored in the field.

bytes_to_str currently has two possible implementations that can be selected through macros: KS_STR_ENCODING_ICONV and KS_STR_ENCODING_NONE. The exact details are documented in the KS C++/STL notes, but in short: KS_STR_ENCODING_ICONV causes all "true strings" to be converted to a common encoding (by default UTF-8) using iconv, and KS_STR_ENCODING_NONE does no conversion at all (which effectively makes all "true strings" behave like byte arrays). The latter is obviously unsafe when working with formats that use more than one encoding, as the documentation points out. As far as I can tell, iconv is widely available on Unix systems, and there are implementations for Windows too, so it should be possible to use KS_STR_ENCODING_ICONV implementation in almost all cases, so that you can work with strings without having to worry about encodings.

KOLANICH commented 4 years ago

Probably utf-16le strings should be wstrings in C++. BTW, how about std::span?

webbnh commented 4 years ago

KS_STR_ENCODING_ICONV causes all "true strings" to be converted to a common encoding (by default UTF-8)

~It's unclear whether that will serve @hallipr -- it depends on whether his downstream code is prepared to handle UTF-8 instead of ASCII....~ Nevermind -- @hallipr's issue is between UTF-8 and UTF-16 -- the latter (AFAIK) should be readily convertible to the former, and I agree that it could be handy if KS could do that for him.

BTW, how about std::span?

That's only available as of C++20; I, for one, am using C++11....

hallipr commented 4 years ago

The downstream code works on "expected values" and opaque strings.

For expected values, i.e. Switch statements with constant cases, they're currently all ascii, but there isn't any code expecting ascii

On Mon, Sep 21, 2020, 2:25 PM webbnh notifications@github.com wrote:

KS_STR_ENCODING_ICONV causes all "true strings" to be converted to a common encoding (by default UTF-8)

It's unclear whether that will serve @hallipr https://github.com/hallipr -- it depends on whether his downstream code is prepared to handle UTF-8 instead of ASCII....

BTW, how about std::span?

That's only available as of C++20; I, for one, am using C++11....

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaitai-io/kaitai_struct/issues/810#issuecomment-696386235, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJ3K4V3AYTGHP57LQE2AGDSG7AEJANCNFSM4RSRVBYQ .

hakanai commented 3 years ago

I feel like this may be my issue also? I wanted to write this:

  text:
    doc: |
      A text string. Can be UTF-8 or UTF-16LE depending on the setting
      in the header.
    seq:
      - id: length
        type: u4
      - id: value
        type: str
        size: length
        encoding:
          switch-on: _root.header.encoding
          cases:
            0: UTF-16LE
            1: UTF-8

But I ended up having to hoist it up to the type:

  text:
    doc: |
      A text string. Can be UTF-8 or UTF-16LE depending on the setting
      in the header.
    seq:
      - id: value
        type:
          switch-on: _root.header.encoding
          cases:
            0: text_utf16
            1: text_utf8

  text_utf16:
    seq:
      - id: length
        type: u4
      - id: value
        type: str
        size: length
        encoding: UTF-16LE

  text_utf8:
    seq:
      - id: length
        type: u4
      - id: value
        type: str
        size: length
        encoding: UTF-8

kaitai-io / kaitai_struct