Open hallipr opened 4 years ago
ternary operators in encoding
Unfortunately, key encoding
supports only literal constant values, it doesn't support KS expressions. So you really have to create two separate attributes and switch them using if
.
And note that even if it supported expressions, any encoding identifier would be a string, and every string literal in KS has to be enclosed in quotes:
encoding: 'is_wide ? "utf16" : "utf8"'
encoding: '"utf-8"' # ' for YAML, " for KS expression
byte array with ternary to_s
don't know how to call method 'identifier(to_s)' of object type 'BytesLimitType(...)'
This bug has been fixed in 0.9 in commit https://github.com/kaitai-io/kaitai_struct_compiler/commit/375a140bb96adeb4c6031e478fad4e03eeabfefb. Make sure you have the latest development 0.9 KS compiler installed (https://kaitai.io/#download), or make your life easier and use the devel Web IDE, which has always the latest KSC.
The other thing is that this expression
value: 'bytes.to_s(utf8)'
is incorrect, because the string utf8 must be enclosed in quotes to be a string literal ("utf8"
). Referencing utf8
would access the value of an attribute with id: utf8
, which can be a seq
field or for example a value instance:
instances:
utf8:
value: 'true ? "utf8" : "ascii"'
If any attribute called utf8
(referenced in the expression bytes.to_s(utf8)
) doesn't exist, the compiler in the devel Web IDE should throw an error like unable to access 'utf8' in test::encoded_string context
, but that unfortunately doesn't happen for some reason. Let's add this to the growing list of compiler bugs :-|
I was surprised that field "bytes" was BytesLimitType and not a raw byte array.
BytesLimitType
actually is a raw byte array. This is just an internal compiler data type of a seq
or instances
field parsed from stream defined using the size
key. If the byte array would be delimited using terminator
byte, it would be called BytesTerminatedType
and so on.
The DataType
class inside the compiler internals stores all information needed to be able to parse an attribute, i.e. it is a more convenient representation of parsing-relevant keys in the attribute definition. For example, the BytesLimitType
case class looks like this:
case class BytesLimitType(
size: Ast.expr,
terminator: Option[Int],
include: Boolean,
padRight: Option[Int],
override val process: Option[ProcessExpr]
) extends BytesType
But if everything works correctly, this type diversity should not be a problem when you're e.g. working with BytesLimitType
and CalcBytesType
(i.e. created from a byte literal like [0xca, 0xfe]
) in the same value instance or when you're calling some methods on them, because they all derive from an abstract class BytesType
, and this type is used anywhere where it doesn't matter which specific byte array it is.
conditional fields and a ternary instance value
This is an instance of issue https://github.com/kaitai-io/kaitai_struct/issues/318, and again, the solution is to download the latest development (unstable) 0.9 version of KSC or use the devel Web IDE.
This isn't directly related to the original question, but you should be aware that strz
currently doesn't work correctly in combination with "wide" encodings like UTF-16 - see #187. In your case it should be easy to work around this bug though - because you have the exact length of the string, you could use str
instead of strz
and remove the zero terminator by reducing the length by 1 character, or using substring
after reading the string.
Thanks. I'll do that
On Sat, Sep 19, 2020, 2:27 PM dgelessus notifications@github.com wrote:
This isn't directly related to the original question, but you should be aware that strz currently doesn't work correctly in combination with "wide" encodings like UTF-16 - see #187 https://github.com/kaitai-io/kaitai_struct/issues/187. In your case it should be easy to work around this bug though - because you have the exact length of the string, you could use str instead of strz and remove the zero terminator by reducing the length by 1 character, or using substring after reading the string.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kaitai-io/kaitai_struct/issues/810#issuecomment-695357762, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJ3K4T75UB5BVYKYXZVADLSGUO43ANCNFSM4RSRVBYQ .
@hallipr, I'd just like to test your expectations, here. UTF-8 and UTF-16 strings really are different data types: you cannot trivially cast from one to the other -- they have to be converted. So, even if you could get a single field to contain either one, I suspect that it wouldn't be long before the downstream code ran into problems. So, it strikes me that working hard to get the KSY to paper over the differences isn't actually going to save you anything.
@webbnh You're right that the string representation in the raw data is different, but immediately after the data is read, the generated parser will convert it into the target language's native string representation. So even though the raw data may have different encodings, the value stored in the field will always have the same type (e. g. String
in Java, str
in Python), so when using the parsed field it no longer matters what encoding the string data had originally.
@dgelessus, how does that work out in C++? std::string
is not the same as std::wstring
there, and each of those is different from std::u16string
. (You cannot easily convert wide characters to narrow characters, because some of them won't fit! ;-) And, if you convert narrow characters to wide characters, then the downstream code has to know about that in order to avoid passing them to things which expect ASCII.)
It's a bit more confusing when KS targets C++, because it represents both byte arrays and "true strings" (KS str
/strz
) as std::string
. When a KS parser reads a raw byte array, the read data is directly read into a std::string
and stored in the field. When it reads a "true string", it first reads the raw data into a std::string
, then passes it through the runtime function kaitai::kstream::bytes_to_str
, which returns another std::string
containing a converted version of the raw data, and that converted string is then stored in the field.
bytes_to_str
currently has two possible implementations that can be selected through macros: KS_STR_ENCODING_ICONV
and KS_STR_ENCODING_NONE
. The exact details are documented in the KS C++/STL notes, but in short: KS_STR_ENCODING_ICONV
causes all "true strings" to be converted to a common encoding (by default UTF-8) using iconv
, and KS_STR_ENCODING_NONE
does no conversion at all (which effectively makes all "true strings" behave like byte arrays). The latter is obviously unsafe when working with formats that use more than one encoding, as the documentation points out. As far as I can tell, iconv
is widely available on Unix systems, and there are implementations for Windows too, so it should be possible to use KS_STR_ENCODING_ICONV
implementation in almost all cases, so that you can work with strings without having to worry about encodings.
Probably utf-16le strings should be wstrings in C++. BTW, how about std::span?
KS_STR_ENCODING_ICONV
causes all "true strings" to be converted to a common encoding (by default UTF-8)
~It's unclear whether that will serve @hallipr -- it depends on whether his downstream code is prepared to handle UTF-8 instead of ASCII....~ Nevermind -- @hallipr's issue is between UTF-8 and UTF-16 -- the latter (AFAIK) should be readily convertible to the former, and I agree that it could be handy if KS could do that for him.
BTW, how about std::span?
That's only available as of C++20; I, for one, am using C++11....
The downstream code works on "expected values" and opaque strings.
For expected values, i.e. Switch statements with constant cases, they're currently all ascii, but there isn't any code expecting ascii
On Mon, Sep 21, 2020, 2:25 PM webbnh notifications@github.com wrote:
KS_STR_ENCODING_ICONV causes all "true strings" to be converted to a common encoding (by default UTF-8)
It's unclear whether that will serve @hallipr https://github.com/hallipr -- it depends on whether his downstream code is prepared to handle UTF-8 instead of ASCII....
BTW, how about std::span?
That's only available as of C++20; I, for one, am using C++11....
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaitai-io/kaitai_struct/issues/810#issuecomment-696386235, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJ3K4V3AYTGHP57LQE2AGDSG7AEJANCNFSM4RSRVBYQ .
I feel like this may be my issue also? I wanted to write this:
text:
doc: |
A text string. Can be UTF-8 or UTF-16LE depending on the setting
in the header.
seq:
- id: length
type: u4
- id: value
type: str
size: length
encoding:
switch-on: _root.header.encoding
cases:
0: UTF-16LE
1: UTF-8
But I ended up having to hoist it up to the type:
text:
doc: |
A text string. Can be UTF-8 or UTF-16LE depending on the setting
in the header.
seq:
- id: value
type:
switch-on: _root.header.encoding
cases:
0: text_utf16
1: text_utf8
text_utf16:
seq:
- id: length
type: u4
- id: value
type: str
size: length
encoding: UTF-16LE
text_utf8:
seq:
- id: length
type: u4
- id: value
type: str
size: length
encoding: UTF-8
I need to parse null-termintated strings with conditional encoding utf16 or utf8 The strings are stored as Length + Value, with length negated for utf16.
If I use separate types with type switching wide_string (utf16) and narrow_string (utf8), when comparing strings, I'm forced to continually cast them to the same type. I'm not sure I can do this when I compile to C#.
I've tried several approaches that avoid type switching and casting:
ternary operators in encoding
error: The encoding label provided ... is invalid.
byte array with ternary to_s
error: don't know how to call method 'identifier(to_s)'
I was surprised that field "bytes" was BytesLimitType and not a raw byte array.
conditional fields and a ternary instance value
error: can't combine output types
This works but is hacky: Conditional fields and ternary instance, forcing string type using .reverse.reverse or .substring(0, length).