kaitai-io / kaitai_struct

Kaitai Struct: declarative language to generate binary data parsers in C++ / C# / Go / Java / JavaScript / Lua / Nim / Perl / PHP / Python / Ruby
https://kaitai.io
4.03k stars 197 forks source link

Expression language <string>.to_i with trailing garbage target differences #1023

Open wader opened 1 year ago

wader commented 1 year ago
meta:
  id: to_i_trailing_garbage
instances:
  a:
    value: '"10abc".to_i'

a will be 10 for ruby and js but for example golang will use strconv.ParseInt("10abc", 10, 0) which will fail, see https://go.dev/play/p/47brswY7C1x

GreyCat commented 1 year ago

I assume we'd want to always be strict and raise a conversion error in case there's any characters we can't comprehend? Any objections, anyone?

GreyCat commented 1 year ago

New test expr_to_i_trailing was added, and some effort was put into bringing languages into converged on this behavior.

As of now, the situation is as follows:

GreyCat commented 1 year ago

This actually opens a huge can of worms in relation to (1) what do we perceive as "normal" integer to be parsed, (2) how far do we want to go enforcing strictness.

Format

Even for a thing as simple as integer, open questions are:

For reference, here's how languages handle it:

What JS Python Ruby
+123 123 123 123
0123 123 123 83
000123 123 123 83
0x123 291 Err 291
0o123 83 Err 83
-0123 -123 -123 -83
-000123 -123 -123 -83
-0x123 Err Err -291
-0o123 Err Err -83
123.456 123.456 Err Err
-123.4e5 -12340000 Err Err

Strictness enforcement

Once we'll decide on the format, it looks like quite a few languages will be performing conversion differently. If we really strive for strictness, this means we'll be doing extra checks manually (e.g. with a regexp). Any extra check if slow. Do we want to enforce it anyway, or provide an option to turn off strict checks?

generalmimon commented 1 year ago

@GreyCat:

  • JS: Number(s)

This is not a good way to parse an integer from string in JavaScript (for example, it doesn't distinguish floats from integers). Currently we use Number.parseInt(), see JavaScriptTranslator.scala:74-75:

  override def strToInt(s: expr, base: expr): String =
    s"Number.parseInt(${translate(s)}, ${translate(base)})"

which is probably the "best" built-in way in JavaScript. It also allows to choose whether we want to recognize other radixes than 10 by specifying or not specifying the radix parameter. See https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/parseInt#parameters:

radix (Optional)
An integer between 2 and 36 that represents the radix (the base in mathematical numeral systems) of the string. It is converted to a [32-bit integer](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Number#fixed-width_number_conversion); if it's nonzero and outside the range of [2, 36] after conversion, the function will always return NaN. **If 0 or not provided, the radix will be inferred based on string's value. Be careful — this does not always default to 10! The [description below](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/parseInt#description) explains in more detail what happens when radix is not provided.**

For example:

console.log(parseInt('-0x100')); // -256
console.log(parseInt('-0x100', 10)); // -0

parseInt() doesn't solve all our problems (for example, it specifically allows trailing garbage), but it's better than Number().

GreyCat commented 1 year ago

Neither of them are ideal, and parseInt() has its own set of problems — e.g. literally lack of any bulletproof way to distinguish between good or failed conversion except for checking for 0.

I guess what this boils down to, is that for JS we'll need to introduce pre-conversion check (likely using regexp), which is expensive.

wader commented 1 year ago

Nice to see some progress on this, here is a go test program.

Also thinking what range is safe/fesable to support? int64? but that might be a problem with js that is 53 bit int safe i think?

$ cat main.go
package main

import (
    "strconv"
)

func main() {
    ss := []string{
        "+123",
        "0123",
        "000123",
        "0x123",
        "0o123",
        "-0123",
        "-000123",
        "-0x123",
        "-0o123",
        "123.456",
        "-123.4e5",
    }
    for _, s := range ss {
        print(s, " ")
        if n, err := strconv.ParseInt(s, 0, 64); err != nil {
            print("Err")
        } else {
            print(strconv.Itoa(int(n)))
        }
        print("\n")
    }
}
$ go run main.go
+123 123
0123 83
000123 83
0x123 291
0o123 83
-0123 -83
-000123 -83
-0x123 -291
-0o123 -83
123.456 Err
-123.4e5 Err
generalmimon commented 1 year ago

@GreyCat My personal answers to the questions in https://github.com/kaitai-io/kaitai_struct/issues/1023#issuecomment-1651349025 would be:

  • Do we want to allow leading +?

No strong opinions. Given that the target languages generally seem to allow it, perhaps we can allow it too, but we should probably first look how some real formats that would benefit from the to_i method are specified.

  • Do we want to allow leading 0? Multiple leading 0?

Let me answer this later in this comment.

  • Do we want to support leading 0 as a sign that it's octal and interpret number as octal then?

Absolutely not, never. There must be a reason why even languages that originally did that stopped doing it. Take JavaScript as an example - as you see in the parseInt() documentation, this function originally did this, but later it was changed to parse 0-prefixed numbers as decimal (which I don't think is best either, but at least you can see that treating 0-prefixed numbers as octal by default seems to be perceived as a legacy thing now) - see https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/parseInt#Browser_compatibility:

  Chrome Edge Firefox Opera Safari Chrome Android Firefox for Android Opera Android Safari on iOS Samsung Internet WebView Android Deno Node.js
parseInt 1 12 1 3 1 18 4 10.1 1 1.0 4.4 1.0 0.10.0
Parses leading-zero strings as decimal, not octal 23 12 21 15 6 25 21 14 6 1.5 4.4 1.0 0.10.0

Also BTW in JavaScript, 0-prefixed numeric literals in code are recognized as octal in the default "sloppy mode", but it's recommended to opt into strict mode that specifically forbids using 0-prefixed integer literals in the code, and I think there were good reasons for this decision.

  • Do we want to allow leading 0x and treat it as hex?
  • Do we want to allow leading 0o and treat it as oct?
  • Do we want to allow leading 0b and treat it as binary?

I think we shouldn't forget that the to_i method which we're talking about here will be used in Kaitai Struct specs for parsing machine-readable formats where the number fields have a well-defined format, so I don't think it makes sense to make to_i too permissive by default when the real formats are not.

I suggest taking inspiration from Python's int() function, which by default accepts only decimal numbers, and any complex 0x/0o/0b prefix handling is opt-in, not opt-out, which I think is a very good decision. As you can see from the signature:

class int(x=0) class int(x, base=10)

... the default base is 10 if you don't specify otherwise. This means that int('0x7f') is the same as int('0x7f', base=10), so only decimal base will be recognized, and this is also reflected in the error message (note the "with base 10"):

>>> int('0x7f')
ValueError: invalid literal for int() with base 10: '0x7f'

If you want to accept integers in 2, 8, 10, or 16 bases according to their prefix (0b, 0o, none or 0x), you can opt in to this behavior by specifying base=0, see https://docs.python.org/3/library/functions.html#int:

For base 0, the string is interpreted in a similar way to an integer literal in code, in that the actual base is 2, 8, 10, or 16 as determined by the prefix.

So I also suggest to make .to_i without any arguments in Kaitai Struct to accept strictly decimal numbers (I think it's best to be strict by default), so it should work the same as .to_i(10). Some formats require different number base than decimal - for example, tar uses octal numbers, so this should normally be specified as .to_i(8).

Only if we recognize some serious use cases in context of Kaitai Struct for the permissive method that would automatically detect the number base based on the 0b/0o/0x prefix (frankly I haven't seen a format with textual number fields without specifying that it's strictly decimal / strictly octal, so I'm not sure we even need this, but OTOH I haven't seen that many formats storing numbers as text at all), it should be opt-in - for example, we can take the Python's way and use .to_i(0) for that.

Back to this question:

  • Do we want to allow leading 0? Multiple leading 0?

For maximum compatibility and to reduce confusion, I'd suggest to handle this the Python way - accept leading 0s (arbitrary number of them) when the number base is fixed and known in advance (either base=10, or any other valid base other than base=0), but disallow them in the permissive mode with base=0 (because in that case it's ambiguous whether it should be treated as decimal or octal, so rather don't try to guess) - see https://docs.python.org/3/library/functions.html#int again:

Base 0 also disallows leading zeros: int('010', 0) is not legal, while int('010') and int('010', 8) are.


  • Do we want to allow floating point format and then just round/truncate/floor/ceil/etc to integer?

Absolutely not. If anyone wants that, they should explicitly parse the string to a float and then convert it to an integer themselves using the method they want.

generalmimon commented 1 year ago

@wader:

Also thinking what range is safe/fesable to support? int64? but that might be a problem with js that is 53 bit int safe i think?

I'd say this is a problem for another issue, we can't hope to solve the problem with integer ranges here. We can't fix everything at once - let's focus on the accepted number format of the to_i method in this issue. For starters, KSC internally coerces almost every integer to CalcIntType just after doing any simple operation returning an integer (e.g. adding two integers together), which is often translated to a 32-bit integer in statically typed languages. So even if we decided on what range to_i should work with, it wouldn't mean much in real .ksy files, because it would likely be coerced to / used as a 32-bit integer soon after.

And adding support for 64-bit integers in JavaScript is another completely different issue tracked in https://github.com/kaitai-io/kaitai_struct/issues/183 that we can't hope to solve here.

generalmimon commented 1 year ago

@GreyCat:

Format

Even for a thing as simple as integer, open questions are:

Some other questions:

generalmimon commented 1 year ago

Also let me comment on this:

If we really strive for strictness, this means we'll be doing extra checks manually (e.g. with a regexp). Any extra check if slow. Do we want to enforce it anyway, or provide an option to turn off strict checks?

Sounds a lot like premature optimization to me. One language that never hesitates to compromise correctness for allegedly better performance is C (array bounds checks are slow and nobody wants them anyway, right?!), and I would strongly dissuade from shifting Kaitai Struct in that direction.

In my opinion, we have to put correctness first, and only once we identify a serious problem in performance (which usually only exists somewhere we don't expect, and the only way to find the actual pain points is to measure some real application of Kaitai Struct), we can think about fixing it, but that must not compromise correctness.

So yeah, I do think any additional extra checks we need to do to achieve correctness and ensure consistent behavior among languages are totally worth it. Of course there are always more ways to implement something, so if we find out that in some languages, checking with regexes is unbearably slow, for example, and there are faster ways to do it, we can use that faster way, as long as correctness is maintained.