kaitai-io / kaitai_struct

Kaitai Struct: declarative language to generate binary data parsers in C++ / C# / Go / Java / JavaScript / Lua / Nim / Perl / PHP / Python / Ruby

https://kaitai.io

3.95k stars 192 forks source link

Built-in ASCII/BCD integer type #666

Open Mingun opened 4 years ago

Mingun commented 4 years ago

I known, it is possible to use

types:
  int:
    params:
      - id: len
        type: u1
    seq:
      - id: as_str
        size: len
        type: str
        encoding: ASCII
    instances:
      as_int:
        value: as_str.to_i

but is is ugly. Many financial protocols (and, I think, other protocols) use ASCII encoded numbers, so it might useful to have built-in type to deal with them

Mingun commented 4 years ago

One benefit from built-in type is that it can be converted to enum. For user type, showed above, this is not possible

dgelessus commented 4 years ago

For reference, a couple of formats that I've implemented in KS use ASCII numbers, such as serialized PHP values and Unix ar archives.

I think there might be too much variation in how formats use ASCII numbers to support them as a native primitive type in KS. I'm not familiar with financial data formats, but I assume most of them use oldschool fixed-width decimal number fields. That's not the case for all formats though - for example the PHP serialization format I linked above uses variable-length ASCII decimal numbers (terminated using : or ;), the ar format uses fixed-width fields but right-padded with spaces (so KS actually parses them like variable-width fields terminated by a space), and some of the ar format number fields are octal rather than decimal.

It would be fairly complicated to create a new native primitive type that can express all of these different kinds of ASCII number fields. It would also be a bit redundant, since there is already a good way to express these fields in KS (as you've shown, using the existing string field parsing features and to_i).

One benefit from built-in type is that it can be converted to enum. For user type, showed above, this is not possible

If this is the main motivation behind this issue, I think it would make more sense to support enums on value instances in general - that would be more flexible than adding special support for just ASCII numbers.

Mingun commented 4 years ago

No, the main claim is that we have an internal structure where it's not needed. ksv, instead of showing the number after field, forces us to fail by 1-2 levels to see the actual value. When there are quite several such fields, the visualization becomes compliated -- it is much easier to look at raw data than at parsed data

Mingun commented 4 years ago

As you said yourself, many formats have this type of data, so it's wise to make it built-in. At the same time, if you make it similar to the type str (which is not really true type because without the size and encoding keys does not describe any particular value), then we can cover most, if not all cases:

seq:
  - id: number
    doc: Parses strings like 00001, 00002, ... into numbers: 1, 2, ...
    type: num
    size: 5
    radix: 10 # Optional field, default 10. Forbidden, if bcd=true
    # Optional field. default false. If true, then convert 0x12 0x34 to:
    # - 1234 in endian=be
    # - 3412 in endian=le
    bcd: false
    # endian: be # Required, if bcd=true

KOLANICH commented 4 years ago

88

generalmimon commented 4 years ago

One benefit from built-in type is that it can be converted to enum. For user type, showed above, this is not possible

If this is the main motivation behind this issue, I think it would make more sense to support enums on value instances in general - that would be more flexible than adding special support for just ASCII numbers.

I'd like to point out that it's already possible to use enums on value instances, just try this:

meta:
  id: enum_ascii_num
seq:
  - id: foo
    type: int(2)
types:
  int:
    params:
      - id: len
        type: u1
    seq:
      - id: as_str
        size: len
        type: str
        encoding: ASCII
    instances:
      as_int:
        value: as_str.to_i
        enum: int_values
    enums:
      int_values:
        17: seventeen
        42: forty_two
        71: seventy_one

Mingun commented 4 years ago

I'd like to point out that it's already possible to use enums on value instances, just try this:

Yes, but that mean, that you need to duplicate int type for any possible enum type... Which makes the type itself useless. It is introduced just to avoid duplicating its code every time

Mingun commented 4 years ago

@KOLANICH, when #88 is some generic solution of many problems, from usability point of view existing of special frequently used type is preferable. For that reason we have almost all types in language, although maybe only b1 is really needed...

GreyCat commented 4 years ago

@Mingun

No, the main claim is that we have an internal structure where it's not needed.

The main goal of KS actually to describe that internal structure of a stream, not generation of API, nor establishing a mapping between internal structure of the stream <=> some representation in memory. So, some attributes might look useless for some purposes (i.e. generation of API), but it's still useful for some others (like DFIR investigations, security audit, educational purposes, etc).

That said, we try to never cross the border to efficiency/performance and applicability of generated code in real world apps. For example, while might be technically possible to specify a compression scheme in KS, it clearly shouldn't be used in production code. You'd rather call existing native implementation and not use the bulky and overly verbose one that KS might generate.

ksv, instead of showing the number after field, forces us to fail by 1-2 levels to see the actual value. When there are quite several such fields, the visualization becomes compliated -- it is much easier to look at raw data than at parsed data

It's really a problem of a visualizer, not the issue with format definition. WebIDE has hints like -webide-representation to display values on top entity level, not forcing you to dive 1-2 levels deep into structures you don't want to see.

@dgelessus

If this is the main motivation behind this issue, I think it would make more sense to support enums on value instances in general - that would be more flexible than adding special support for just ASCII numbers.

But we already support enums of value instances — https://github.com/kaitai-io/kaitai_struct_tests/blob/master/formats/enum_of_value_inst.ksy?

generalmimon commented 4 years ago

@Mingun:

Yes, but that mean, that you need to duplicate int type for any possible enum type...

No, you don't, what holds you from making enum from the outside of the int type:

meta:
  id: enum_ascii_num
seq:
  - id: animal_int
    type: int(2)
  - id: unit_int
    type: int(2)
instances:
  animal:
    value: animal_int.as_int
    enum: animal_enum
  unit:
    value: unit_int.as_int
    enum: unit_enum
enums:
  animal_enum:
    1: pig
    2: cow
    3: horse
  unit_enum:
    1: centimeter
    2: meter
    3: feet
types:
  int:
    params:
      - id: len
        type: u1
    seq:
      - id: as_str
        size: len
        type: str
        encoding: ASCII
    instances:
      as_int:
        value: as_str.to_i

I personally don't see any benefit of introducing a ASCII number type. In most cases, interpreting string as number is not needed for the primary task that KS performs, i.e. parsing. And when you're convinced that you need to do it on the side of KS, the ASCII number type can't be used if the string uses an ASCII incompatible encoding, e.g. UTF-16.

    radix: 10

It is possible to do string.to_i(radix), see https://doc.kaitai.io/user_guide.html#_strings

    bcd: false

We have the BCD type in the formats gallery. I think it's much better to import this KSY type to achieve interpreting the field as BCD-encoded number than having some built-in way to do it. The first reason is that it's easy to modify the bcd.ksy spec when you need some unusual variation, add a parameter to choose this option and merge it to the format gallery. If we've had the built-in BCD, adding some option to it would probably mean that all runtime libraries need to be updated. Nobody wants to do that and it's hard to assure that all runtime libraries will work the same.

And I frankly don't understand how is the Binary Coded Decimal connected with ASCII number types. It looks like you want the proposed type: num to sometimes operate on the ASCII characters, sometimes on the numerical values of the nibbles or bytes in the stream, depending on the bcd key.

On top of that, you'd need to introduce a lot of options (radix, bcd, endian and others necessary to change the parameters of the BCD) that are intricately binded (some cannot be used with another, others are required if some another is present, some change the behavior of others etc.) The KSY options should be independent and consistent as much as possible.

Mingun commented 4 years ago

The main goal of KS actually to describe that internal structure of a stream

The question is, where we must stop to digging? Should we show individual bits of numbers? Or individual symbols of string (especially, when variable-size encodings, such as UTF-8, is used)? I think, that all of you say no -- if you really need such internal structure, define it youself in your type.

That exaclty what I suppose -- built-in integer type, that can be used, when you are not interesting in internal structure of that type. If you need such structure, just do not use it and define it youself.

No, you don't, what holds you from making enum from the outside of the int type:

Ok, that is possible, but compare, how many work need to be done just for arrive such trivial goal! I think we can improve the balance between language wealth and usability. The most frequently used things, as well as language congestion, are two extremes to avoid. From that point of view existence of built-in number type, that can cover many use-cases is better, that absence of it.

I personally don't see any benefit of introducing a ASCII number type

As I see, it is question about balance. For me built-in type will be useful. And, to be exact, UTF-16 not compatible with ASCII (but UTF-8 compatible).

I think it's much better to import this KSY type to achieve interpreting the field as BCD-encoded number than having some built-in way to do it

It depends on level, on which you work. If your describe protocol, that mostly deal with individual bits, then internal structure of such numbers can be useful. But when your work at highter level, it begins to disturb. Instead of working with a number, you have to constantly look into its internal structure, although you don't need anything from it other than a numerical value. It's not just an inconvenience to visualization. Using this number (actually -- structure) in expressions becomes pain -- instead of writing num you need write num.value. Yes, you can create instance, that hide num.value and you can use just instance. But you need to do that for EVERY possible numeric field. Instead of describing protocol itself you starts to describe various helper stuffs and eventually there is no forest visible behind the trees.

If we've had the built-in BCD, adding some option to it would probably mean that all runtime libraries need to be updated

Yes, it is a fee for substantial simplification of life. But,

very unlikely, that new parameters will be required, so it is very unlikely, that update will be required. Anyway, you already need to update runtimes, when serialization branch will be merged. I think, that to that time new type can be implemented in all runtimes
built-in type has no goal to cover all possible situations, but most of them. Once again, it's a balance

And I frankly don't understand how is the Binary Coded Decimal connected with ASCII number types.

In the word BCD two parts -- Binary and Decimal. Decimal -- is that connection to ASCII numbers. In fact, while I was writing an example of use, it only outgrown ASCII numbers and became a type to describe relative to any integer. So, actually, type must have name integer. For example, you can use this type to describe the BigInteger, if size are quite big (and for Java it will be translated to BigInteger class).

On top of that, you'd need to introduce a lot of options (radix, bcd, endian and others necessary to change the parameters of the BCD) that are intricately binded (some cannot be used with another, others are required if some another is present, some change the behavior of others etc.) The KSY options should be independent and consistent as much as possible.

It's not a new concept. KSY already contains such options:

terminator used only with type: str or implicit byte[] type, encoding used only with string types, endian -- only with number types
consume changes behavior of terminator, endian changes how to number parsing work
repeat-expr, repeat-until can not be used without appropriate repeat key

So I do not see any problems with that. And actually, only two new keys a proposed -- radix and bcd.

dgelessus commented 4 years ago

Regarding my earlier comment:

If this is the main motivation behind this issue, I think it would make more sense to support enums on value instances in general - that would be more flexible than adding special support for just ASCII numbers.

As multiple people correctly pointed out, this is already supported - I didn't check properly before commenting. I was replying to this comment by @Mingun, which I interpreted as saying that enum on value instances doesn't work, but I might have just read it wrong.

dgelessus commented 4 years ago

We have the BCD type in the formats gallery. I think it's much better to import this KSY type to achieve interpreting the field as BCD-encoded number than having some built-in way to do it.

This sums up my opinion on this issue quite well. I completely agree that common types should be convenient to use, but that shouldn't require making the types built-in. Instead we should work on improving the situations where user types are less convenient to use than built-in types, for example with inlining (#88) as @KOLANICH has mentioned above.

Using this number (actually -- structure) in expressions becomes pain -- instead of writing num you need write num.value. Yes, you can create instance, that hide num.value and you can use just instance.

And I think the proper fix for that is not to make the type in question built-in, but to add a KS feature to auto-generate these instances (or something similar to that effect). This would be much more flexible - you could use this feature with whatever custom types your format/protocol uses, not just a few ones that KS considers "common" enough to be built-in.

very unlikely, that new parameters will be required

There was actually a discussion in our Gitter chat a few weeks ago where someone was working with a BCD format that our existing bcd.ksy didn't support yet.

Anyway, you already need to update runtimes, when serialization branch will be merged. I think, that to that time new type can be implemented in all runtimes

I'm not sure I understand the argument here? Implementing serialization already takes a lot of work, how is it helpful to add more work on top of that (for an unrelated feature)?

In the word BCD two parts -- Binary and Decimal. Decimal -- is that connection to ASCII numbers.

Aside from being numbers, BCD numbers have very little to do with ASCII numbers though - their encodings are completely different. Also, nothing about ASCII numbers requires them to be decimal - as mentioned before, I've encountered octal ASCII number fields before.

For example, you can use this type to describe the BigInteger, if size are quite big (and for Java it will be translated to BigInteger class).

What structure do you expect a "big integer" to have? Almost every language has its own in-memory representation for big integers, and there are a few different on-disk formats for big integers too.

KOLANICH commented 4 years ago

BigInteger

https://github.com/KOLANICH/kaitai_struct_formats/blob/cbor/serialization/cbor.ksy#L172L258

dgelessus commented 4 years ago

@KOLANICH Yes, there are binary formats that have big/variable-sized integers, but my point is that they are not standardized. There are vlq_128_le and vlq_128_be specs in ksf, but these are not the only way to represent big integers (as demonstrated by your CBOR example).

Mingun commented 4 years ago

I'm not sure I understand the argument here? Implementing serialization already takes a lot of work, how is it helpful to add more work on top of that (for an unrelated feature)?

I just mean, that relatively soon all runtimes will be upgraded. But, as I see, there is still time for implement this feature to get it on the same release.

Also, nothing about ASCII numbers requires them to be decimal - as mentioned before, I've encountered octal ASCII number fields before.

Yes, proposed syntax allow you to define radix (or may be name it base?). And, don't loop on the ASCII word. At the beginning of this discussion, I considered only such numbers, but now the proposed change goes beyond them.

What structure do you expect a "big integer" to have? Almost every language has its own in-memory representation for big integers, and there are a few different on-disk formats for big integers too.

In that case I just mean, that if you describe integer in proposed syntax as:

id: really_big_number
type: integer
size: 100500

then ksc can generate field of type java.math.BigInteger, because the biggest native integer type in java is long (19 characters in decimal form). And this type still can be used in integer context in kaitai expressions

Very hot discussion... maybe issue number matter :)

KOLANICH commented 4 years ago

id: really_big_number
type: b100500

Mingun commented 4 years ago

This not work (at least in webide):

Parse error: undefined Call stack: undefined Error: readBitsInt: the maximum supported bit length is 32 (tried to read 100500 bits)

KOLANICH commented 4 years ago

Of course not. But this kind of syntax already exists, no need to introduce a new one for that.

Mingun commented 4 years ago

It means slightly different thing. It will interpret number in the same manner, as i8, i16, i32 and so on. Proposed syntax do the same thing, but works on byte or nibble level. So it is not just new syntax.

KOLANICH commented 4 years ago

I don't understand what you mean at all.

Proposed syntax do the same thing, but works on byte or nibble level.

Do you still mean builtin BCD in https://github.com/kaitai-io/kaitai_struct/issues/666#issuecomment-573335738 . If so, the proposed syntax is completely inacceptable for BCD. If not ...

uX has semantics of X-bytes unsigned integer. sX has semantics of X-bytes signed integer. bX has semantics of a bit-sized unsigned integer. It is also proposed to have bsX for bit-sized signed integers. There is also a proposal and even a python impl of supporting wild endiannesses #76 and applying endiannesses to bit-sized integers #155 . So, do we miss anything needed?

Very hot discussion... maybe issue number matter :)

You have definitely failed to get a get ;)

Mingun commented 4 years ago

So, do we miss anything needed?

Yes, the ability to express damn simple human-readable numbers, like 01234 (as bytes it is 0x30 0x31 0x32 0x33 0x34) Ok, consider the new type as a generalization of existing integer types, giving you new capabilities.

- id: some_number
  # Because type for string has short `str` name instead of long `string`,
  # use short `int` instead of long `integer` for consistency
  type: int
  # Unit, in which size is expressed. Default: ascii
  # ascii is a bit outsider, because it as byte, but only some bit patterns allowed:
  # only [0x30-0x39, 0x41-0x5A, 0x61-0x7A]
  #        '0'-'9'    'A'-'Z'    'a'-'z'
  unit: bit | nibble | byte | ascii
  # Number of units to represent integer. Required
  size: 1 -- +inf
  # Base of the counting system in which the number is represented
  # bit: forbidden
  # nibble: 2-16, default: 10
  # byte: 2-256, default: 256
  # ascii: 2-36, default: 10
  base: 2 -- 256
  # Order of units. Required, if size > 1
  # => is equal to be
  # <= is equal to le
  # Read it as "to get a number, read units by arrow"
  endian: be|=>|le|<=

Examples:

bXX:
  # examples (size = 16):
  #   [0x12 0x34] == [0b_0001 0b_0010 0b_0011 0b_0100]:
  #     - endian: => or be
  #       value:  0b_0001_0010_0011_0100 === 0x1234
  #     - endian: <= or le. Fixes #155
  #       value:  0b_0010_1100_0100_1000 === 0x2C48
  seq:
    - id: value
      doc: 'Built-in bit types: b1, b2, ..., bXX'
      type: int
      unit: bit
      size: XX
      endian: =>
bcd:
  # examples (size = 4):
  #   [0x12 0x34]:
  #     - endian: => or be
  #       value:  1234
  #     - endian: <= or le. Fixes #155, @JaapAap example
  #       value:  4321
  seq:
    - id: value
      doc: Binary Coded Decimal
      type: int
      unit: nibble
      size: XX
      endian: =>
uXX:
  # examples (size = 4):
  #   [0x12 0xAB]:
  #     - endian: => or be
  #       value:  0x12AB
  #     - endian: <= or le
  #       value:  0xBA21
  seq:
    - id: value
      doc: 'Built-in byte types: u1, u2, ..., uXX'
      type: int
      unit: nibble
      base: 16
      size: XX
      endian: =>
uXX:
  # examples (size = 2):
  #   [0x12 0x34]:
  #     - endian: => or be
  #       value:  0x1234 == 0x12*256 + 0x34 == 18*256 + 52 == 4660‬
  #     - endian: <= or le
  #       value:  0x3412 == 0x34*256 + 0x12 == 52*256 + 18 == 13330‬
  seq:
    - id: value
      doc: 'Built-in byte types: u1, u2, ..., uXX'
      type: int
      unit: byte
      base: 256
      size: XX
      endian: =>
decimal:
  # examples (size = 2):
  #   [0x31 0x32] == ['1' '2']:
  #     - endian: => or be
  #       value:  12
  #     - endian: <= or le
  #       value:  21
  seq:
    - id: value
      doc: Numbers in ASCII representation
      type: int
      unit: ascii
      size: XX
      endian: =>
oct:
  # examples (size = 2):
  #   [0x31 0x32] == ['A' 'B']:
  #     - endian: => or be
  #       value:  0o12 == 1*8 + 2 == 10
  #     - endian: <= or le
  #       value:  0o21 == 2*8 + 1 == 19
  seq:
    - id: value
      doc: Numbers in octal ASCII representation
      type: int
      unit: ascii
      base: 8
      size: XX
      endian: =>
hex:
  # examples (size = 2):
  #   [0x41 0x42] == ['A' 'B']:
  #     - endian: => or be
  #       value:  0xAB == 10*16 + 11 == 171
  #     - endian: <= or le
  #       value:  0xBA == 11*16 + 10 == 186
  seq:
    - id: value
      doc: Numbers in hexadecimal ASCII representation
      type: int
      unit: ascii
      base: 16
      size: XX
      endian: =>
base36_number:
  # examples (size = 2):
  #   [0x50 0x42] == ['P' 'B']:
  #     - endian: => or be
  #       value:  0x50*36 + 0x42 == 80*36 + 66 == 2946‬
  #     - endian: <= or le
  #       value:  0x42*36 + 0x50 == 66*16 + 80 == 2456
  seq:
    - id: value
      doc: Numbers in base36 ASCII representation
      type: int
      unit: ascii
      base: 36
      size: XX
      endian: =>

as for #76. Do we really need that complicated standard types for such exotic situations, as not 8-bit groups of bits? They in general exist? Here, I think, quite enough custom types

KOLANICH commented 4 years ago

It ASCII nums don't look like it makes sense to build them into KS itself.

as for #76. Do we really need that complicated standard types for such exotic situations, as not 8-bit groups of bits? They in general exist? Here, I think, quite enough custom types

IDK. I have heard about strange "mixed" endianneses (where 2-byte chunks are ordered using one endiannes and single-byte chunks within them are ordered using another endianness) used in some hardware. IDK how many variations of these exist (I also don't know why BE exists, because IMHO LE seems to be the only natural encoding scheme (because chunk offset (in terms of chunks) matches power of its coefficient)). So I have designed the way to encode them.

dgelessus commented 4 years ago

  type: int
  unit: bit | nibble | byte | ascii
  size: 1 -- +inf

What's the advantage of splitting this information over three attributes instead of the existing bX/uX system? You could just use bcdXX and aXX for the new types. Also, using size to specify the number of "digits" is incompatible with its existing meaning in all other cases - for example even if you have a type: str field with encoding: UTF16LE, size is given in bytes, even though UTF-16 strings consist of 2-byte code units.

  base: 2 -- 256

Is this needed for anything except ASCII? I'm not aware of any BCD variants with bases other than 10, and all other number formats effectively have byte digits with base 256.

  # Order of units. Required, if size > 1
  # => is equal to be
  # <= is equal to le
  # Read it as "to get a number, read units by arrow"
  endian: be|=>|le|<=

This doesn't need to be a separate attribute either - endianness is normally specified file-wide, and you can override it by using types like u4le. Also, ASCII numbers are always big-endian in practice, so it doesn't make much sense to allow (or even require) an endian attribute, or have them be affected by the file-wide default endianness.

If you want to propose alternative syntax for endian, that would be better in a separate issue. I think it's unlikely to be added though, there's not much use in adding two new names that mean exactly the same thing as the existing and well-known be and le.

I don't really see the advantage of a generalized int type that adds a lot of variants that will never be used in practice - keep in mind that all possible combinations would need to be tested. As far as I can tell, all the actually useful variants (normal packed integers, BCD, and ASCII numbers) are already handled by existing language features and standard specs, with the only downside being that with BCD and ASCII numbers you have to write field.value instead of field (for which I've suggested a possible solution above).

GreyCat commented 4 years ago

Let me draw some bottom line for this discussion:

Overall, having extra integer types like BCD, VLQ, ASCII numbers, etc, seems beneficial to the project, so people could reuse the same machinery and not reinvent the wheel every time.
If something can be implemented as KSY spec, it should be implemented like that, and we should strive to have them all in common place like ksf/common.
There's a very good idea that KS implementations and "native" (=built-in) implementations should be interchangeable. This way we can eventually add more and more native implementations and transparently migrate to it once it makes sense to do so.
Implementation of type as "native" should not differ in anything related to end-user experience except for performance / memory consumption benefits. If a user wishes to trade ability to do detailed analysis of the structure vs performance / memory — that's fine, but it should not be mandatory.
There should be an easy way to:
- use type without .value, .as_int, etc — as suggested in https://github.com/kaitai-io/kaitai_struct/issues/312#issuecomment-358628457 (and #88, #171 to certain extent)
- see representation of the structure in visualizer without going into much detail (i.e. something similar to -webide-representation, but more widely supported by other visualizers)

So, in relation to the original question — please contribute ASCII number type to ksf/common, and enhance BCD type there if it does not cover your use cases. We'll eventually get to implement them natively if/when it would be clear that it brings benefits.

Mingun commented 4 years ago

What's the advantage of splitting this information over three attributes instead of the existing bX/uX system?

With size you will be able to define integers with context-dependent size, for example, consume digits up to some terminator.
This is more clearly expresses intentions
This is not replacement, but generalization to non-standard situations, that can't or hardly handled with current approach
This can produce better errors when parsing
Parsing such numbers can consume less memory because you do not need to store array from which they is originated, you can construct number on the fly

Also, using size to specify the number of "digits" is incompatible with its existing meaning in all other cases - for example even if you have a type: str field with encoding: UTF16LE, size is given in bytes, even though UTF-16 strings consist of 2-byte code units.

This is because currently minimal addressable unit is a byte. But if you think, that it is strange, size can be always expressed in bytes, but then representing bit integers with not multiples of 8 size will be impossible in that syntax with all its advantages. If the loss of this possibility appears to be a more acceptable alternative than a breach of uniformity in size units, I am not opposed to. For BCD, for example, even if digit count is odd, number often (if not always) occupied the whole count of bytes, and just one nibble is not used.

Is this needed for anything except ASCII? I'm not aware of any BCD variants with bases other than 10, and all other number formats effectively have byte digits with base 256.

I don't known (c). This is trivially to implement, so... Why not? If you strictly opposed against this, that ability can be forbidden, I just do not see any benefits from this.

This doesn't need to be a separate attribute either - endianness is normally specified file-wide, and you can override it by using types like u4le. Also, ASCII numbers are always big-endian in practice, so it doesn't make much sense to allow (or even require) an endian attribute, or have them be affected by the file-wide default endianness.

"Required" here in its ordinary meaning in other endian-sensitive types -- required, if global endian is not defined. For numbers with byte/ascii units endian is not required at all (as for u1/s1). It is required for nibble and bit units and allow to define, for example, different schemas of BCD numbers natively

If you want to propose alternative syntax for endian, that would be better in a separate issue.

Of course. Here it is absolutely optional

I don't really see the advantage of a generalized int type that adds a lot of variants that will never be used in practice - keep in mind that all possible combinations would need to be tested.

You just not needed them. This is not the strict reason for veto :). As I said above, this type adds consistency to the system. This is simply a logical continuation of string types and even ordinal float! Float numbers is complex types: they consists of a sign bit, an exponent bits and a mantissa bits. Why does practically no one have a desire to know this internal structure? Because no one needs it. Knowing that a mantissa is represented by that integer and an exponent by another gives nothing -- you still can't use them separately from each other. Just as a single string character makes no sense -- in most cases, it doesn't matter what bytes it occupies. You will never work with individual characters in a protocol that transmits an entire string, or separately with a mantissa and exponent, in a protocol that transmits a float. Therefore, it is logical that for them there are built-in types that are mapped into languages built-in types. The proposed number type is no different from them, so I believe that all arguments, like, "but how the user will see their internal structure", can be repudiated. If one really need that, it always can reimplement integer, or string, or float in terms of bytes and bits.

with the only downside being that with BCD and ASCII numbers you have to write field.value instead of field (for which I've suggested a possible solution above).

It also leads to not the best API of generated classes, and while another discussion suggested that API is not the main thing, I will allow myself to disagree with that. If the API was unnecessary, there would be no generators for different languages. And in general, what to do with KSY files without parsers generated on them? Use in a couple of programs that know how to understand them? No, generated API is the one of most important things. So improvements in language is needed to be able to generate more convenient APIs.

So, in relation to the original question — please contribute ASCII number type to ksf/common, and enhance BCD type there if it does not cover your use cases. We'll eventually get to implement them natively if/when it would be clear that it brings benefits.

So, of course, I implement needed types and create a PR, but I would not like it to remain another unimplemented feature. I tried to provide agrucations regarding consistency, usability and opportunities. I think I did everything I could for that moment

KOLANICH commented 4 years ago

Now I understand your proposal. What I like here is that it separates integerness from its size into separate YAML fields, that is beneficial for tools (one doesn't have to parse and serialize the custom format in type, instead we can rely on a yaml lib). The issue here is that it would take a lot of space in ksy files and would be slower to type. And we usually need a lot of such fields, so IMHO it doesn't worth to split 1 yaml field with 3.