kaitai-io / kaitai_struct

Kaitai Struct: declarative language to generate binary data parsers in C++ / C# / Go / Java / JavaScript / Lua / Nim / Perl / PHP / Python / Ruby
https://kaitai.io
4.03k stars 197 forks source link

Unconsistent requirements for enum keys when they represented as numbers and as strings #1132

Open Mingun opened 1 month ago

Mingun commented 1 month ago

Kaitais struct allows to define enum key either as number, or as string which value is number. https://github.com/kaitai-io/kaitai_struct_compiler/blob/c23ec2ca88d84042edba76f70c1f003d062b7585/shared/src/main/scala/io/kaitai/struct/format/EnumSpec.scala#L33-L35 Numbers are extracted using ParseUtils.asLong method... https://github.com/kaitai-io/kaitai_struct_compiler/blob/c23ec2ca88d84042edba76f70c1f003d062b7585/shared/src/main/scala/io/kaitai/struct/format/ParseUtils.scala#L186-L194 ...which uses Utils.strToLong which uses simple regular expressions to detect decimal and hexadecimal numbers: https://github.com/kaitai-io/kaitai_struct_compiler/blob/c23ec2ca88d84042edba76f70c1f003d062b7585/shared/src/main/scala/io/kaitai/struct/Utils.scala#L29-L34

^(-?[0-9]+)$
^0x([0-9a-fA-F]+)$

That means that you cannot:

All of that creates inconsistence between literal numbers and literal strings in YAML markup:

meta:
  id: enums
enums:
  enum:
    1_1: dec_int
    # '1_1': dec_string # error: unable to parse `1_1` as int
    0x1_1: hex_int
    # '0x1_1': hex_string # error: unable to parse `0x1_1` as int
    # 0X1: hex_uppercase # error: unable to parse `0X1` as int

For consistency it would be better reuse either YAML int parser, or int parser from expression language.

Also note, that both YAML parser and regular expression allows leading zeroes for decimal numbers, but expression language not: https://github.com/kaitai-io/kaitai_struct_compiler/blob/c23ec2ca88d84042edba76f70c1f003d062b7585/shared/src/main/scala/io/kaitai/struct/exprlang/Lexical.scala#L86 This inconsistency probably also should be fixed.

448 may be related (it is possible to solve both problems at once).

generalmimon commented 1 month ago

@Mingun Maybe if you actually read the comment inside the asLong method implementation, you would understand what is the only reason it uses Utils.strToLong and you wouldn't have to open issues just to show how confused you are all the time - ParseUtils.scala:180-194:

  def asLong(src: Any, path: List[String]): Long = {
    src match {
      case n: Long =>
        n
      case n: Int =>
        n
      case str: String =>
        // Generally should not happen, but when the data comes from JavaScript,
        // all object keys are forced to be strings.
        try {
          Utils.strToLong(str)
        } catch {
          case ex: MatchError =>
            throw KSYParseError.withText(s"unable to parse `$str` as int", path)
        }

OK, let me clarify. We never wanted to allow string keys in enums, but since the parsed YAML document that we get from an external YAML parser in JavaScript will be a JavaScript object, and JavaScript objects only allow strings and symbols as keys, enum integer IDs will be stringified, and there's nothing we can do about that - the compiler must be ready for that. But since the YAML parser will actually interpret them and then convert them to string, we know that we can only expect strings as returned by the Number.prototype.toString() method. Which means that the compiler will only ever encounter a decimal number without any underscores _ as visual separators, never hex number or octal number (these have been all converted to the canonical decimal number by the YAML parser). So we could actually simplify Utils.strToLong(str) to str.toLong (which only accepts decimal numbers, see Long.parseLong(String s)), and the functionality that is actually supposed to work would still work.

Yes, this has the side effect that it technically allows you to use actual strings as keys (which prevents the YAML parser from interpreting it as an integer). So what? How does this limit anyone when writing .ksy specs? I don't think it does, because the chance that someone would accidentally wrap enum integer IDs in quotes like '2': apple seems to be very close to zero.

Mingun commented 1 month ago

I don't understand what JavaScript has to do with it. Isn't SnakeYAML used for parsing? String on input -> compiler parses using it's own understanding of YAML rules?

As you known, I'm writing an alternative KS compiler. I tries to be as much compatible with existing implementation as I can. That means writing a lot of tests of corner cases, because compatible cannot mean "compatible here and there, but not here". I want to create drop-in replacement (in "compatible" mode). When I try to implement some things in a most straightforward way I find, that original compiler behaves differently. That's need explanation, because I do not want just write code without comments, since this vicious practice has already borne fruit in this project.

Which means that the compiler will only ever encounter a decimal number without any underscores _ as visual separators, never hex number or octal number (these have been all converted to the canonical decimal number by the YAML parser).

This does not explaining why then parser tries to extract HEX numbers from strings or even uses regexps to ensure integers. Wasn't it easier to call Long.valueOf right away? It is, because you immediately came to this same conclusion.

Yes, this has the side effect that it technically allows you to use actual strings as keys (which prevents the YAML parser from interpreting it as an integer). So what? How does this limit anyone when writing .ksy specs? I don't think it does, because the chance that someone would accidentally wrap enum integer IDs in quotes like '2': apple seems to be very close to zero.

The probability is higher that you think: https://github.com/kaitai-io/kaitai_struct_formats/blob/e533644618670a3b94be687253b0bf11a051640c/scientific/nt_mdt/nt_mdt.ksy#L650-L664

Actually, I started investigating this issue, because my compiler was stricter in that case and could not parse those keys.


So what the expected behavior? What I should test in my tests? Usage of string keys should be discouraged, need to fix all .ksy where they are used?

generalmimon commented 1 month ago

I don't understand what JavaScript has to do with it. Isn't SnakeYAML used for parsing?

SnakeYAML is a Java library, so it only runs on the JVM platform. Theferore, only the JVM build of KSC can use it. With the JavaScript build, you have to use a JavaScript library to do the YAML parsing.

String on input -> compiler parses using it's own understanding of YAML rules?

This is only true for the part of KSC code that is JVM-specific (i.e. the jvm/src/... directory in KSC, more specifically, jvm/src/main/scala/io/kaitai/struct/formats/JavaKSYParser.scala:76 in this case). The JavaScript-specific part of KSC, on the other hand (js/src/main/scala/io/kaitai/struct/format/JavaScriptKSYParser.scala:18), doesn't handle the YAML parsing at all and expects the caller to handle the YAML parsing using any library they want.

Until recently, the Web IDE used (and the stable version at https://ide.kaitai.io/ still does) an abandoned shitty YAML parsing library https\://github.com/jeremyfa/yaml.js (I don't even link to it, don't go there). It had a number of very noticeable and annoying bugs (I explained all of which I knew about in https://github.com/kaitai-io/kaitai_struct_webide/issues/165), so https://ide.kaitai.io/devel/ already uses https://github.com/nodeca/js-yaml, which works much better (perhaps even better than SnakeYAML - I really like its error messages).

That's need explanation, because I do not want just write code without comments, since this vicious practice has already borne fruit in this project.

Well, I've seen a number of times that if there are comments in the code that explain exactly what you're confused about, you just skip them or don't notice them. Then even the best comments in the world are useless.

This does not explaining why then parser tries to extract HEX numbers from strings or even uses regexps to ensure integers. Wasn't it easier to call Long.valueOf right away? It is, because you immediately came to this same conclusion.

It happens from time to time that the code you see isn't written in the best and simplest possible way. This often happens when something is refactored or an existing pattern is used and no thought is given to whether something simpler could be used. It happens to the best of us.

Usage of string keys should be discouraged, need to fix all .ksy where they are used?

Yes.

generalmimon commented 1 month ago

The probability is higher that you think: https://github.com/kaitai-io/kaitai_struct_formats/blob/e533644618670a3b94be687253b0bf11a051640c/scientific/nt_mdt/nt_mdt.ksy#L650-L664

Of course it had to be @KOLANICH, who else :) But it's 7 years ago to be fair.

Mingun commented 1 month ago

Is it needed to keep special handling for JS at all? It's 2024, Map is widely available:

This feature is well established and works across many devices and browser versions. It’s been available across browsers since July 2015.

Probably modern YAML parsing library already uses it for storing mappings?

Mingun commented 1 month ago

For the reference: there is also one .ksy with underscores in numbers, which should be prohibited in YAML 1.2 (which is js-yaml implements), but it allows them right now:

https://github.com/kaitai-io/kaitai_struct_formats/blob/e533644618670a3b94be687253b0bf11a051640c/game/renderware_binary_stream.ksy#L419-L434

generalmimon commented 1 month ago

Probably modern YAML parsing library already uses it for storing mappings?

I know https://github.com/eemeli/yaml offers to read into a Map, but not by default. https://github.com/nodeca/js-yaml is in my opinion better for our use case than https://github.com/eemeli/yaml (I explained why in detail in https://github.com/kaitai-io/kaitai_struct_webide/issues/165#issuecomment-1925843709), but it only reads into plain objects. But in my view, this whole "problem" that we're discussing here is a trifle that hardly matters. We'll be able to fix it easily with https://github.com/kaitai-io/kaitai_struct/issues/229, but right now there's no point in jumping through hoops so that the compiler handles this very niche case consistently. It just isn't worth it.

For the reference: there is also one .ksy with underscores in numbers, which should be prohibited in YAML 1.2 (which is js-yaml implements), but it allows them right now:

Yeah, and I like this. I also discuss this in https://github.com/kaitai-io/kaitai_struct_webide/issues/165#issuecomment-1925843709. I don't think either 100% YAML 1.1 compliant mode or 100% YAML 1.2 compliant mode makes sense, and I'm pretty happy with the compromise that https://github.com/nodeca/js-yaml offers.

Also, https://bitbucket.org/snakeyaml/snakeyaml mostly implements YAML 1.1, but not where it doesn't make sense (for example, it doesn't parse y / n as true / false, but "y" / "n", which is a deviation from the specification, but very reasonable in practice).


To be honest, I believe the final solution to the annoying differences of YAML 1.1 and YAML 1.2 will be fully adopting the YAML 1.2 failsafe schema, as I described in https://github.com/kaitai-io/kaitai_struct_webide/issues/165#issuecomment-1932229231. The failsafe schema means that the YAML parsers will only provide mappings, sequences and strings, and will never try to interpret certain strings as boolean / integer / float / null regardless of whether they are quoted or not, so we'll be able to process everything consistently in KSC ourselves. But this kind of discussion belongs more to https://github.com/kaitai-io/kaitai_struct/issues/229 again.

This will finally fix some edge cases that we're currently not able to fix properly, but I don't think it will make a huge difference to the end user. I believe that the behavior of https://ide.kaitai.io/devel/ using js-yaml and the command-line KSC using SnakeYAML is already pretty well compatible right now, except for some very marginal cases (such as different yes / on handling, off the top of my head).