kaitai-io / kaitai_struct

Kaitai Struct: declarative language to generate binary data parsers in C++ / C# / Go / Java / JavaScript / Lua / Nim / Perl / PHP / Python / Ruby
https://kaitai.io
4.04k stars 199 forks source link

Expression language string method substring behavior #1021

Open wader opened 1 year ago

wader commented 1 year ago

Hi, i noticed this difference while testing things:

meta:
  id: substring_diff
instances:
  a:
    value: '"hello".substring(4,1)'

ksdump (ruby):

$ docker run -v "$PWD:/share" -it --entrypoint=ksdump kaitai/ksv -f json /dev/zero substring_diff.ksy | tail +5
{
  "a": ""
}

webide (javascript):

{
  "a": "ell"
}

Haven't looked deeper but i guess the behavior comes from how the JavaScript string substring method works.

Is there a preferred kaitai behavior?

generalmimon commented 1 year ago

@wader:

instances:
  a:
    value: '"hello".substring(4,1)'

This is pretty much undefined behavior right now. As you can see in https://doc.kaitai.io/user_guide.html#str-methods, the str.substring() method expects two arguments - from and to:

Method name Return type Description
substring(from, to) String Extracts a portion of a string between character at offset from and character at offset to - 1 (including from, excluding to)

And it's implicitly assumed from <= to (from == to gives you an empty string ""). The from > to was unfortunately not thought of, so it's not very surprising to me that there are differences across target languages, because each language defines its own behavior in this case and KS doesn't do any attempt to standardize this so far.


But I agree with unifying this. The idea of KS is indeed that all parsers generated from a .ksy spec should behave the same in all cases, and to achieve that, it's sometimes needed overcome the differences of the languages, sometimes by providing a custom implementation of certain operations in the runtime library (actually, this is one of the main goals of the runtime library, to provide a standard API regardless of the language specifics).

For substring(from, to) in the case of from > to, I think it makes sense to return an empty string "" (as in the from == to case).

This issue is quite similar in nature to https://github.com/kaitai-io/kaitai_struct/issues/746 - integer division also behaves differently across targets when the result is negative.

generalmimon commented 1 year ago

@wader Unrelated: GitHub has quite good syntax highlighting for code blocks, but you need to specify the language. For your comment here (https://github.com/kaitai-io/kaitai_struct/issues/1021#issue-1664302708), it would be ```ksy (it has an entry in github/linguist, so it's recognized by GitHub out of the box and the .ksy files on GitHub are also automatically highlighted as YAML thanks to that), ```console and ```json.

wader commented 1 year ago

And it's implicitly assumed from <= to (from == to gives you an empty string ""). The from > to was unfortunately not thought of, so it's not very surprising to me that there are differences across target languages, because each languages defines its own behavior in this case and KS doesn't do any attempt to standardize this so far.

But I agree with unifying this. The idea of KS is indeed that all parsers generated from a .ksy spec should behave the same in all cases, and to achieve that, it's sometimes needed overcome the differences of the languages, sometimes by providing a custom implementation of certain operations in the runtime library (actually, this is one of the main goals of the runtime library, to provide a standard API regardless of the language specifics).

For substring(from, to) in the case of from > to, I think it makes sense to return an empty string "" (as in the from == to case).

This issue is quite similar in nature to #746 - integer division also behaves differently across targets when the result is negative.

👍 yeah i think KS would benefit from having has few undefined behaviors as possible. I'm not sure how people usually use kaitai but maybe most generate to one language so don't notice differences much?

I also found a difference for <string>.to_i when there is trailing garbage. If i remember correctly js just ignores but go and maybe some others fail. Should I create a new issue for that?

wader commented 1 year ago

@wader Unrelated: GitHub has quite good syntax highlighting for code blocks, but you need to specify the language. For your comment here (#1021 (comment)), it would be ```ksy (it has an entry in github/linguist, so it's recognized by GitHub out of the box and the .ksy files on GitHub are also automatically highlighted as YAML thanks to that), ```console and ```json.

Aha didn't know there was ksy support, nice. Yeap i try to use highlighting but forgot sometimes, i actually added jq support to github linguist some time ago :)

def woho: 1+2;
generalmimon commented 1 year ago

@wader:

I'm not sure how people usually use kaitai but maybe most generate to one language so don't notice differences much?

Yes, I think so. All targets find their users, but most are only focused on one language (or possibly GraphViz + one programming language), so however the KS-generated parser in that language behaves, they think "that's how Kaitai works" I guess.

Which harms the idea of .ksy specs being language-agnostic of course, because other users may encounter issues when trying to use a .ksy spec in another language.

generalmimon commented 1 year ago

@wader:

I also found a difference for <string>.to_i when there is trailing garbage. (...) Should I create a new issue for that?

Yes, please, much appreciated ❤️