kaitai-io / kaitai_struct

Kaitai Struct: declarative language to generate binary data parsers in C++ / C# / Go / Java / JavaScript / Lua / Nim / Perl / PHP / Python / Ruby
https://kaitai.io
3.98k stars 194 forks source link

It is possible to stop parsing by raising parse error? #674

Open Mingun opened 4 years ago

Mingun commented 4 years ago

In the financial industry the ISO 8583 protocol is widely used. This protocol consist of:

  1. A bitmap -- several bytes, each bit in their determines, that fields present in the message
  2. Message fields

For example, first 2 fields of many ISO 8583 implementations can be described by the following ksy schema:

meta:
  id: iso8583
  endian: be
types:
  str2:
    seq:
      - id: length
        type: u2
      - id: value
        size: length
        type: str
        encoding: ASCII
seq:
  - id: bitmap
    type: b1
    repeat: expr
    repeat-expr: 64
  - id: secondary_bitmap
    type: b1
    repeat: expr
    repeat-expr: 64
    if: bitmap[0]
  - id: pan
    doc: n..19
    type: str2
    if: bitmap[1]

Protocol itself is not describes, how to determine field length. So it is not backward-compatible, if you add new field between already defined fields, because old clients will not known, how to skip unknown field -- they do not known its length. That means, that parsers must throws an exception, when it comes with unknown field.

How to express that in ksy? Right now, if delete secondary_bitmap member from definition and try to parse message, that contains both, secondary_bitmap and pan fields, ksy will parse bytes from secondary_bitmap, as pan field

GreyCat commented 4 years ago

Can you use #435 for it, with something like any-of?

Mingun commented 4 years ago

Until I understand how it will help. What exactly it should validate?

I had the thought of suggesting a new bitmap type, maybe as replacement of seq:

meta:
  id: iso8583
bitmap:
  size: 64 # 64bit
  # bit, in which secondary bitmap is located. Some ISO 8583 variations
  # uses not first, but last (64) bit for secondary bitmap
  continue: 1
  # number of bits, [1..size]. "continue" bit can not be used
  2:
    id: pan
    type: str2
  64:
    id: mac
    size: 8
    encoding: hex # an example of #668 - read 8 bytes, decode hex and get 4 bytes in mac
types:
  str2:
    ...
GreyCat commented 4 years ago

Um, title suggested that you wanted to stop parsing by raising an error. I don't quite understand the condition on which you want to stop, but if you can explain it, then probably that can be used as validation expression. Failing validation raises parse error.

As far as I understand, you have a sequence of bits that you either know and can describe what a bit stands for (and which additional payload it should bring), or you don't know how to handle that payload and thus you demand that bit to be zero (otherwise that means you've encountered something you can't parse). Thus, it boils down to:

seq:
  # "bitmap"
  - id: has_secondary_bitmap
    type: b1
  - id: has_pan
    type: b1
  - id: reserved1
    type: b62 # consumes the rest of 64-bit sequence
    valid: 0 # we expect that no other bits are used

  # payloads that we know how to parse, depends on "bitmap" bits
  - id: secondary_bitmap
    # ...
    if: has_secondary_bitmap
  - id: pan
    # ...
    if: has_pan

Unfortunately, I don't really understand the proposal for "bitmap" type as replacement for "seq". Can you clarify the problem you're trying to solve with it?

Mingun commented 4 years ago

Yes, all exactly as you describe. But, because I suspect, there can be similar formats, that has defined rules to determines payload boundaries, it will be also useful to allow define default case, that will be used for not described fields.

Unfortunately, I don't really understand the proposal for "bitmap" type as replacement for "seq". Can you clarify the problem you're trying to solve with it?

Not a replacement in the sense of completely removing seq. Just another case of top-level structure of the type. Should automate description of the type, that you manually written above, but without necessary duplicate names (XXX and has_XXX), which can lead to mistakes. Besides, that special type definition can produce better errors. In your example error will look like: "reserved1 should be 0, but XXX is encountered" or similar. Or, if describe it as

  - id: unknown
    type: b1
    repeat: expr
    repeat-expr: 62
    valid:
      all: false # Arbitrarily choosed syntax to check individual elements of array

"unknown[XXX] should be false but it is true". If anything, I don't see how an error in such a very general mechanism can be improved for that case.

Also, it is trivially to specify default case in that syntax:

# Raises a parse error with nice message if encounter bit, not described in the message, ex.:
# "Unknown field #X in a bitmap message <type name>"
bitmap:
  size: 64
  1: secondary_bitmap
  2: pan
  ...

# All not described fields parsed with `default` type
bitmap:
  size: 64
  default: # regular type definition (bitmap also possible :))
    seq:
      - id: len
        ...
  # default: default_case # also accept that
  1: secondary_bitmap
  2: pan
  ...

Also, it shouldn't be forgotten that non-continuous regions of bits can be defined (ex: 1, 10, 20), i.e. several reserved definitions will be needed and all of them will need to be checked, if some checks will be required. You will have to manually maintain bitmap and field definitions in sync

GreyCat commented 4 years ago

I don't see much value in this top-level structure proposal: bit flags already provide pretty solid implementation, and it's much more flexible. I don't see having xxx + has_xxx as big burden, and having it spelled explicitly allows you to control order in which these additional payloads are appearing, for example:

seq:
  - id: has_foo
    type: b1
  - id: has_bar
    type: b1
  - id: reserved
    type: b6
    value: 0 # no need for complicated all elements of bit array are 0 checks
  - id: more_attributes_here
    size: 1234
    # there could be actually more attributes here
  - id: bar # bar comes 1st
    # ...
    if: has_bar
  - id: foo # foo comes 2nd, despite order of bits if "foo", "bar"
    # ...
    if: has_foo

"reserved1 should be 0, but XXX is encountered" "unknown[XXX] should be false but it is true"

I don't see how these two are substantially different: both do the job and both allow to pinpoint what exactly is false. On the contrary, addressing bits as [XXX] is generally very confusing, as there is no universal agreement on what "bit 27", for example, is.

Mingun commented 4 years ago

I don't see how these two are substantially different:

I say both messages are equally bad. And I see no way to improve them in the current implementation

GreyCat commented 4 years ago

Ok, what message would you consider a better choice?

Mingun commented 4 years ago

As I say above in code example, something like

Unknown field #\ in a bitmap message \. Unknown fields breaks message layout, so can't be skipped and prevent further parsing

or

Not known how to parse field #\ or it's size. Field can't be skipped in a bitmap message \

where <X> is bit number