Expression language <string>.to_i with trailing garbage target differences

meta:
  id: to_i_trailing_garbage
instances:
  a:
    value: '"10abc".to_i'

a will be 10 for ruby and js but for example golang will use strconv.ParseInt("10abc", 10, 0) which will fail, see https://go.dev/play/p/47brswY7C1x

I assume we'd want to always be strict and raise a conversion error in case there's any characters we can't comprehend? Any objections, anyone?

New test expr_to_i_trailing was added, and some effort was put into bringing languages into converged on this behavior.

As of now, the situation is as follows:

C++98/C++11: pass
Java: mixed (spec for Java 7 is broken as @generalmimon mentioned)
Python: pass
Ruby: failed (should use Integer(x) conversion instead of x.to_i to actually raise the error)
everything else: spec not yet implemented, need KST generator upgrade

This actually opens a huge can of worms in relation to (1) what do we perceive as "normal" integer to be parsed, (2) how far do we want to go enforcing strictness.

Format

Even for a thing as simple as integer, open questions are:

Do we want to allow leading +?
Do we want to allow leading 0? Multiple leading 0?
Do we want to support leading 0 as a sign that it's octal and interpret number as octal then?
- If number is unrepresentable as octal (e.g. 0999) do we treat it as decimal or treat it as invalid?
Do we want to allow leading 0x and treat it as hex?
Do we want to allow leading 0o and treat it as oct?
Do we want to allow leading 0b and treat it as binary?
What about the same — octal, hex, binary notation — with - at the beginning?
Do we want to allow floating point format and then just round/truncate/floor/ceil/etc to integer?

For reference, here's how languages handle it:

JS: Number(s)
Python: int(s)
Ruby: Integer(s)

What	JS	Python	Ruby
`+123`	123	123	123
`0123`	123	123	83
`000123`	123	123	83
`0x123`	291	Err	291
`0o123`	83	Err	83
`-0123`	-123	-123	-83
`-000123`	-123	-123	-83
`-0x123`	Err	Err	-291
`-0o123`	Err	Err	-83
`123.456`	123.456	Err	Err
`-123.4e5`	-12340000	Err	Err

Strictness enforcement

Once we'll decide on the format, it looks like quite a few languages will be performing conversion differently. If we really strive for strictness, this means we'll be doing extra checks manually (e.g. with a regexp). Any extra check if slow. Do we want to enforce it anyway, or provide an option to turn off strict checks?

@GreyCat:

JS: Number(s)

This is not a good way to parse an integer from string in JavaScript (for example, it doesn't distinguish floats from integers). Currently we use Number.parseInt(), see JavaScriptTranslator.scala:74-75:

  override def strToInt(s: expr, base: expr): String =
    s"Number.parseInt(${translate(s)}, ${translate(base)})"

which is probably the "best" built-in way in JavaScript. It also allows to choose whether we want to recognize other radixes than 10 by specifying or not specifying the radix parameter. See https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/parseInt#parameters:

radix (Optional)

An integer between 2 and 36 that represents the radix (the base in mathematical numeral systems) of the string. It is converted to a [32-bit integer](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Number#fixed-width_number_conversion); if it's nonzero and outside the range of [2, 36] after conversion, the function will always return NaN. **If 0 or not provided, the radix will be inferred based on string's value. Be careful — this does not always default to 10! The [description below](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/parseInt#description) explains in more detail what happens when radix is not provided.**

For example:

console.log(parseInt('-0x100')); // -256
console.log(parseInt('-0x100', 10)); // -0

parseInt() doesn't solve all our problems (for example, it specifically allows trailing garbage), but it's better than Number().

Neither of them are ideal, and parseInt() has its own set of problems — e.g. literally lack of any bulletproof way to distinguish between good or failed conversion except for checking for 0.

I guess what this boils down to, is that for JS we'll need to introduce pre-conversion check (likely using regexp), which is expensive.

Nice to see some progress on this, here is a go test program.

Also thinking what range is safe/fesable to support? int64? but that might be a problem with js that is 53 bit int safe i think?

$ cat main.go
package main

import (
    "strconv"
)

func main() {
    ss := []string{
        "+123",
        "0123",
        "000123",
        "0x123",
        "0o123",
        "-0123",
        "-000123",
        "-0x123",
        "-0o123",
        "123.456",
        "-123.4e5",
    }
    for _, s := range ss {
        print(s, " ")
        if n, err := strconv.ParseInt(s, 0, 64); err != nil {
            print("Err")
        } else {
            print(strconv.Itoa(int(n)))
        }
        print("\n")
    }
}
$ go run main.go
+123 123
0123 83
000123 83
0x123 291
0o123 83
-0123 -83
-000123 -83
-0x123 -291
-0o123 -83
123.456 Err
-123.4e5 Err

@GreyCat My personal answers to the questions in https://github.com/kaitai-io/kaitai_struct/issues/1023#issuecomment-1651349025 would be:

Do we want to allow leading +?

No strong opinions. Given that the target languages generally seem to allow it, perhaps we can allow it too, but we should probably first look how some real formats that would benefit from the to_i method are specified.

Do we want to allow leading 0? Multiple leading 0?

Let me answer this later in this comment.

Do we want to support leading 0 as a sign that it's octal and interpret number as octal then?

Absolutely not, never. There must be a reason why even languages that originally did that stopped doing it. Take JavaScript as an example - as you see in the parseInt() documentation, this function originally did this, but later it was changed to parse 0-prefixed numbers as decimal (which I don't think is best either, but at least you can see that treating 0-prefixed numbers as octal by default seems to be perceived as a legacy thing now) - see https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/parseInt#Browser_compatibility:

	Chrome	Edge	Firefox	Opera	Safari	Chrome Android	Firefox for Android	Opera Android	Safari on iOS	Samsung Internet	WebView Android	Deno	Node.js
`parseInt`	1	12	1	3	1	18	4	10.1	1	1.0	4.4	1.0	0.10.0
Parses leading-zero strings as decimal, not octal	23	12	21	15	6	25	21	14	6	1.5	4.4	1.0	0.10.0

Also BTW in JavaScript, 0-prefixed numeric literals in code are recognized as octal in the default "sloppy mode", but it's recommended to opt into strict mode that specifically forbids using 0-prefixed integer literals in the code, and I think there were good reasons for this decision.

Do we want to allow leading 0x and treat it as hex?

Do we want to allow leading 0o and treat it as oct?

Do we want to allow leading 0b and treat it as binary?

I think we shouldn't forget that the to_i method which we're talking about here will be used in Kaitai Struct specs for parsing machine-readable formats where the number fields have a well-defined format, so I don't think it makes sense to make to_i too permissive by default when the real formats are not.

I suggest taking inspiration from Python's int() function, which by default accepts only decimal numbers, and any complex 0x/0o/0b prefix handling is opt-in, not opt-out, which I think is a very good decision. As you can see from the signature:

class int(x=0) class int(x, base=10)

... the default base is 10 if you don't specify otherwise. This means that int('0x7f') is the same as int('0x7f', base=10), so only decimal base will be recognized, and this is also reflected in the error message (note the "with base 10"):

>>> int('0x7f')
ValueError: invalid literal for int() with base 10: '0x7f'

If you want to accept integers in 2, 8, 10, or 16 bases according to their prefix (0b, 0o, none or 0x), you can opt in to this behavior by specifying base=0, see https://docs.python.org/3/library/functions.html#int:

For base 0, the string is interpreted in a similar way to an integer literal in code, in that the actual base is 2, 8, 10, or 16 as determined by the prefix.

So I also suggest to make .to_i without any arguments in Kaitai Struct to accept strictly decimal numbers (I think it's best to be strict by default), so it should work the same as .to_i(10). Some formats require different number base than decimal - for example, tar uses octal numbers, so this should normally be specified as .to_i(8).

Only if we recognize some serious use cases in context of Kaitai Struct for the permissive method that would automatically detect the number base based on the 0b/0o/0x prefix (frankly I haven't seen a format with textual number fields without specifying that it's strictly decimal / strictly octal, so I'm not sure we even need this, but OTOH I haven't seen that many formats storing numbers as text at all), it should be opt-in - for example, we can take the Python's way and use .to_i(0) for that.

Back to this question:

Do we want to allow leading 0? Multiple leading 0?

For maximum compatibility and to reduce confusion, I'd suggest to handle this the Python way - accept leading 0s (arbitrary number of them) when the number base is fixed and known in advance (either base=10, or any other valid base other than base=0), but disallow them in the permissive mode with base=0 (because in that case it's ambiguous whether it should be treated as decimal or octal, so rather don't try to guess) - see https://docs.python.org/3/library/functions.html#int again:

Base 0 also disallows leading zeros: int('010', 0) is not legal, while int('010') and int('010', 8) are.

Do we want to allow floating point format and then just round/truncate/floor/ceil/etc to integer?

Absolutely not. If anyone wants that, they should explicitly parse the string to a float and then convert it to an integer themselves using the method they want.

@wader:

Also thinking what range is safe/fesable to support? int64? but that might be a problem with js that is 53 bit int safe i think?

I'd say this is a problem for another issue, we can't hope to solve the problem with integer ranges here. We can't fix everything at once - let's focus on the accepted number format of the to_i method in this issue. For starters, KSC internally coerces almost every integer to CalcIntType just after doing any simple operation returning an integer (e.g. adding two integers together), which is often translated to a 32-bit integer in statically typed languages. So even if we decided on what range to_i should work with, it wouldn't mean much in real .ksy files, because it would likely be coerced to / used as a 32-bit integer soon after.

And adding support for 64-bit integers in JavaScript is another completely different issue tracked in https://github.com/kaitai-io/kaitai_struct/issues/183 that we can't hope to solve here.

@GreyCat:

Format

Even for a thing as simple as integer, open questions are:

Some other questions:

Do we want to allow single underscores interspersed between digits?

The int() function in Python accepts them:
```
>>> int('10_000')
10000
```
https://docs.python.org/3/library/functions.html#int:

Optionally, the string can be preceded by + or - (with no space in between), have leading zeros, be surrounded by whitespace, and have single underscores interspersed between digits.

But for example the parseInt() function in JavaScript does not - https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/parseInt#Using_parseint:
parseInt doesn't work with numeric separators:
```
parseInt("123_456"); // 123
```
Personally, I don't think we should allow them, just saying that we should probably test that the to_i method rejects them with an error in all target languages.
Do we want to allow leading/trailing whitespace?

I think many languages would allow it by default, but some don't - for example, see from_str_radix in Rust:
The string is expected to be an optional + or - sign followed by digits. Leading and trailing whitespace represent an error. Digits are a subset of these characters, depending on radix:
- 0-9
- a-z
- A-Z
We don't support Rust yet (although we will eventually) - that's really just an example of how one language handles it.

I'd prefer to follow Rust here, because the default behavior is strict and doesn't have any hidden surprises, and only if you need more flexibility, you preprocess the string yourself using trim - e.g. str.trim().parse::<i32>() seems to be a very common pattern in Rust, and this is completely fine because everyone can see what it does and what inputs it accepts. Anyone can easily choose how strict they want to be without touching the actual .parse::<i32>() part - you can choose if you only want to trim spaces, or also newlines, tabs, Unicode whitespace, etc., just by replacing .trim() with something else.

The only problem with this approach is that we don't actually have string manipulation functions like trim in Kaitai Struct yet (we only have substring), so the decision to disallow leading/trailing whitespace in to_i would temporarily make it difficult/impossible to properly describe certain formats which have space-padded number text fields (and such definitely exist, for example see the ar format with this number type https://github.com/kaitai-io/kaitai_struct_formats/pull/126/files#diff-ba1c0e0c6d7fbff314f1adef982f86d33cd9fb2d5b880c8d676c1419b3fac30c submitted by @dgelessus).

But still, arguably the most flexible way would be to just add functions like trim to Kaitai Struct to enable such use cases, rather than making to_i directly tolerate whitespace.
Do we want to tolerate redundant 0b/0o/0x prefixes even when the number base is specified (i.e. allow '0b1011'.to_i(2), '0o755'.to_i(8) or '0x7f'.to_i(16) in addition to just '1011'.to_i(2), '755'.to_i(8) and '7f'.to_i(16))?

Again, some languages would allow that - e.g. Python's int() supports all the prefixes ("Base-2, -8, and -16 strings can be optionally prefixed with 0b/0B, 0o/0O, or 0x/0X, as with integer literals in code."), JavaScript's parseInt() would only tolerate 0x/0X like this ("If the radix is 16, parseInt() allows the string to be optionally prefixed by 0x or 0X after the optional sign character (+/-)."). On the other hand, the from_str_radix only accepts valid digits for the given number base and has no concept of prefixes.

My suggestion would be to follow Rust here, i.e. keep it simple and require all characters (after the optional +/- sign) to be valid digits for the given base (which implicitly means that any prefixes are unrecognized and rejected because of invalid digits).

If anyone really wants to accept e.g. a 0x prefix for a integer field that's known to be hex (which I'm not sure any actual format requires), they could do that relatively easily with the substring method that we already have - for example, whereas s.trim.to_i(16) would not accept a 0x prefix, one could do conceptually something like this: (s.trim.substring(0, 2) == '0x' ? s.trim.substring(2, s.trim.length) : s.trim).to_i(16) (this doesn't account for the possible presence of the +/- sign, but you get the idea)

Also let me comment on this:

If we really strive for strictness, this means we'll be doing extra checks manually (e.g. with a regexp). Any extra check if slow. Do we want to enforce it anyway, or provide an option to turn off strict checks?

Sounds a lot like premature optimization to me. One language that never hesitates to compromise correctness for allegedly better performance is C (array bounds checks are slow and nobody wants them anyway, right?!), and I would strongly dissuade from shifting Kaitai Struct in that direction.

In my opinion, we have to put correctness first, and only once we identify a serious problem in performance (which usually only exists somewhere we don't expect, and the only way to find the actual pain points is to measure some real application of Kaitai Struct), we can think about fixing it, but that must not compromise correctness.

So yeah, I do think any additional extra checks we need to do to achieve correctness and ensure consistent behavior among languages are totally worth it. Of course there are always more ways to implement something, so if we find out that in some languages, checking with regexes is unbearably slow, for example, and there are faster ways to do it, we can use that faster way, as long as correctness is maintained.

kaitai-io / kaitai_struct