Closed koczkatamas closed 6 years ago
I vote for matching against lower case strings, but using toLowerCase
on encoding string first.
Totally agree, runtimes should treat these strings in case-insensitive way (actually, I thought all of them behave that way already — and they almost do so), and we should declare some canonical way to spell them in the style guide. I'd go with lowercasing everything, as it's simple and
Also, I'd declare a canonical way to spell widespread encoding names, i.e.:
utf8
vs utf-8
utf-16-le
vs utf16-le
vs utf-16le
vs utf16le
iso8859-1
vs iso-8859-1
vs iso88591
And, while we're there, some encoding names should be definitely banned, for example, utf16
and ucs2
(as it lacks information on endianness and relies of current platform's native endianness) and I'm somewhat reluctant about ucs2le
and ucs2be
(as it's kind of hard to find true UCS2 encoding parser, not UTF16 one).
And, while we're there, some encoding names should be definitely banned, for example, utf16 and ucs2 (as it lacks information on endianness and relies of current platform's native endianness) and I'm somewhat reluctant about ucs2le and ucs2be (as it's kind of hard to find true UCS2 encoding parser, not UTF16 one).
I'm against it.
1 If we specify endianness in meta
I guess we mean that text encodings would also have that endianness.
2 platform endianness is not that bad and may be useful somewhere, for example if some system service, for example kernel, produces binary messages in machine endianness and we need to read these messages from the languages other than C/C++ (which has the natural way - just to include the header used to build that service and use the type defined in it).
Also, I'd declare a canonical way to spell widespread encoding names
On Python _ and - are interchangable, like in: utf_8 utf-8 utf8
. See python docs.
some encoding names should be definitely banned, for example, utf16
If the issue is that this encoding has no endianness, then I should also point out the same python docs. UTF32 and UTF16 I think as well, have BOM markers within the stream that define the endianness. So the encodings are valid and endianness is known at runtime (well, at parsing time).
If we specify endianness in meta I guess we mean that text encodings would also have that endianness.
It would be better not to couple those, not without some guarantees that is the case.
Okay, I will fix the JS runtime then and check other runtimes if they have the same issue (I don't expect them).
Quick search gave me the following "canonical" names for encodings (if I am not mistaken): UTF-8
, UTF-16LE
, ISO-8859-1
. So I presume we could use their lowercase variants: utf-8
, utf-16le
and iso-8859-1
, etc.
((About modifying the existing formats in the repo: I think it won't be necessary now to do it, but it will be enough later when we will(?) enforce the style guide for formats in the repo.))
There is actually #116 where I've started doing massive encodings table for it a long time ago: https://docs.google.com/spreadsheets/d/1l87kGi9_U4Xrgaw2CGaTc9-_f5UEf1nf-68Dk_e1_iA/edit?usp=sharing
So, yeah, I believe we should stick to "canonical" spellings as per IANA "preferred MIME name", or just IANA's "name", if it's there.
US-ASCII
looks pretty weird for me, I would keep ascii
as an alias (compatibility reasons and simply it's easier to write down). But the others look okay.
There's no universal agreement in the world for all these names, so the very point of that table is to find some common ground and the foundation. While it's probably a very long term goal, what I propose now is to:
Fixed the issue in the JS runtime (https://github.com/kaitai-io/kaitai_struct_javascript_runtime/commit/dd658ee3f3a90ec562c13846dc914b56d1fed617) and released new npm+Github version (0.8.0-SNAPSHOT.11
).
Checked the other runtimes and I did not find similar issue in them (usually they just pass the encoding value to the platforms native text decoder; in Lua it already lowercases the encoding before it compares to "ascii").
From my side this issue can be closed, but we can continue to use it for what @GreyCat just described in the previous comment.
Let's close this one ;) I'll add relevant issues separately.
Currently the JS runtime only allows the
encoding
value to beASCII
, but notascii
:https://github.com/kaitai-io/kaitai_struct_javascript_runtime/blob/8fedb4d2f278f55d0d8fcf24a9bbf1c1ea2b150f/KaitaiStream.js#L511-L513
I checked the formats repo and 9 file formats used lowercase
ascii
at 23 occurrences.Should we fix the formats or should the runtimes accept lowercase
ascii
too?(I presume we should fix the runtimes, but maybe add a KSY styling rule which casing should be used generally?)