Encoding value casing (eg. ASCII)

kaitai-io / kaitai_struct

Kaitai Struct: declarative language to generate binary data parsers in C++ / C# / Go / Java / JavaScript / Lua / Nim / Perl / PHP / Python / Ruby

https://kaitai.io

4k stars 196 forks source link

Encoding value casing (eg. ASCII) #391

Closed koczkatamas closed 6 years ago

koczkatamas commented 6 years ago

Currently the JS runtime only allows the encoding value to be ASCII, but not ascii:

https://github.com/kaitai-io/kaitai_struct_javascript_runtime/blob/8fedb4d2f278f55d0d8fcf24a9bbf1c1ea2b150f/KaitaiStream.js#L511-L513

I checked the formats repo and 9 file formats used lowercase ascii at 23 occurrences.

Should we fix the formats or should the runtimes accept lowercase ascii too?

(I presume we should fix the runtimes, but maybe add a KSY styling rule which casing should be used generally?)

KOLANICH commented 6 years ago

I vote for matching against lower case strings, but using toLowerCase on encoding string first.

GreyCat commented 6 years ago

Totally agree, runtimes should treat these strings in case-insensitive way (actually, I thought all of them behave that way already — and they almost do so), and we should declare some canonical way to spell them in the style guide. I'd go with lowercasing everything, as it's simple and

Also, I'd declare a canonical way to spell widespread encoding names, i.e.:

utf8 vs utf-8
utf-16-le vs utf16-le vs utf-16le vs utf16le
iso8859-1 vs iso-8859-1 vs iso88591

And, while we're there, some encoding names should be definitely banned, for example, utf16 and ucs2 (as it lacks information on endianness and relies of current platform's native endianness) and I'm somewhat reluctant about ucs2le and ucs2be (as it's kind of hard to find true UCS2 encoding parser, not UTF16 one).

KOLANICH commented 6 years ago

And, while we're there, some encoding names should be definitely banned, for example, utf16 and ucs2 (as it lacks information on endianness and relies of current platform's native endianness) and I'm somewhat reluctant about ucs2le and ucs2be (as it's kind of hard to find true UCS2 encoding parser, not UTF16 one).

I'm against it. 1 If we specify endianness in meta I guess we mean that text encodings would also have that endianness. 2 platform endianness is not that bad and may be useful somewhere, for example if some system service, for example kernel, produces binary messages in machine endianness and we need to read these messages from the languages other than C/C++ (which has the natural way - just to include the header used to build that service and use the type defined in it).

arekbulski commented 6 years ago

Also, I'd declare a canonical way to spell widespread encoding names

On Python _ and - are interchangable, like in: utf_8 utf-8 utf8. See python docs.

some encoding names should be definitely banned, for example, utf16

If the issue is that this encoding has no endianness, then I should also point out the same python docs. UTF32 and UTF16 I think as well, have BOM markers within the stream that define the endianness. So the encodings are valid and endianness is known at runtime (well, at parsing time).

If we specify endianness in meta I guess we mean that text encodings would also have that endianness.

It would be better not to couple those, not without some guarantees that is the case.

koczkatamas commented 6 years ago

Okay, I will fix the JS runtime then and check other runtimes if they have the same issue (I don't expect them).

Quick search gave me the following "canonical" names for encodings (if I am not mistaken): UTF-8, UTF-16LE, ISO-8859-1. So I presume we could use their lowercase variants: utf-8, utf-16le and iso-8859-1, etc.

((About modifying the existing formats in the repo: I think it won't be necessary now to do it, but it will be enough later when we will(?) enforce the style guide for formats in the repo.))

GreyCat commented 6 years ago

There is actually #116 where I've started doing massive encodings table for it a long time ago: https://docs.google.com/spreadsheets/d/1l87kGi9_U4Xrgaw2CGaTc9-_f5UEf1nf-68Dk_e1_iA/edit?usp=sharing

So, yeah, I believe we should stick to "canonical" spellings as per IANA "preferred MIME name", or just IANA's "name", if it's there.

koczkatamas commented 6 years ago

US-ASCII looks pretty weird for me, I would keep ascii as an alias (compatibility reasons and simply it's easier to write down). But the others look okay.

GreyCat commented 6 years ago

There's no universal agreement in the world for all these names, so the very point of that table is to find some common ground and the foundation. While it's probably a very long term goal, what I propose now is to:

add a "white list" of encodings that are well-known, canonically named and are guaranteed to work in all languages
add a "grey list" of encodings which are known to be aliases and non-canonical spellings; for these, ksc would map them into canonical names in all languages, and issue a warning to the user
add a "black list" of encodings that should not be allowed; using these should break the build and clear error should be given to the user

koczkatamas commented 6 years ago

Fixed the issue in the JS runtime (https://github.com/kaitai-io/kaitai_struct_javascript_runtime/commit/dd658ee3f3a90ec562c13846dc914b56d1fed617) and released new npm+Github version (0.8.0-SNAPSHOT.11).

Checked the other runtimes and I did not find similar issue in them (usually they just pass the encoding value to the platforms native text decoder; in Lua it already lowercases the encoding before it compares to "ascii").

From my side this issue can be closed, but we can continue to use it for what @GreyCat just described in the previous comment.

GreyCat commented 6 years ago

Let's close this one ;) I'll add relevant issues separately.