kaitai-io / kaitai_struct

Kaitai Struct: declarative language to generate binary data parsers in C++ / C# / Go / Java / JavaScript / Lua / Nim / Perl / PHP / Python / Ruby
https://kaitai.io
4.02k stars 197 forks source link

Validation of parsed data with assertions #435

Open GreyCat opened 6 years ago

GreyCat commented 6 years ago

Extracting easier part of #81, with a concrete proposal on implementation.

Context: we want to have some sort of way to state that certain attributes have some sort of "valid" values. Stating that has two implications:

  1. All values which are "not valid" must trigger an error of parsing
  2. In some cases, we can calculate values automatically during generation of stream

The simplest existing implementation is contents: it is essentially a byte array, which validates that its contents are exactly as specified in ksy file. On parsing, we'll throw an error if byte array contents won't match the expected value. On generation, we can just write the expected value, and not bother the end-user with setting the attribute manually.

Syntax proposal

Add valid key to the attribute spec. Inside it, there could be:

- id: foo
  type: u4le
  valid: 0x42 # expected in stream: 42 00 00 00
- id: bar
  type: strz
  valid: '"abcd"' # expected in stream: 61 62 63 64 00
- id: baz
  type: u2le
  valid: '2 * foo' # expected in stream: 84 00

All expected and expression values are KS expression strings, which can be constants or can depend on other attributes. list-of-expected must be a YAML array of KS expression strings. expected inferred type must be comparable with type of the attribute. expression inferred type must be boolean.

KOLANICH commented 6 years ago

You have suggested to check such relations using syntax in YAML. Having this syntax in YAML has both drawbacks and profits. The drawback is that we have to support it alongside with usual expressions. The profit is that it is yaml, any tools working with yaml would be able to parse it.

So shouod we drop our current expr syntax and start using entirely yaml-based?

GreyCat commented 6 years ago

So shouod we drop our current expr syntax

What exactly is "our current expr syntax", and what do you mean by "dropping" it? We had no syntax proposal for validation until now, and I suspect what you imply is covered in valid/expr, for example, checking that value is even:

- id: foo
  type: u4
  valid:
    expr: _ % 2 == 0
webbnh commented 6 years ago

A map, which can feature keys

If the field is supposed to contain an enum value, can I check that by specifying a key of enum, or can any-of take the enum type-name instead of a list?

GreyCat commented 6 years ago

@webbnh Probably for enums we can just do the following:

  1. Ensure that in all languages "invalid" enum values does not trigger a error by default
  2. Add extra key here, something like that:
- id: foo
  type: u4
  enum: bar
  valid:
    enum-key: true
KOLANICH commented 6 years ago

What exactly is "our current expr syntax", and what do you mean by "dropping" it?

What you have proposed can be expressed as a logical expression, assumming that we have some facilities. eq: expected <=>

throw:
  if: _ != expected

min: expected <=> if: _ < expected max: expected <=> if: _ > expected any-of, enum-key <=> #434

But we can go from another side. We can eliminate textual expressions and force the developer into using YAML AST.

for example if: a != 0 and b > _io.size would be something like

size:
  and:
    - '!=':
      - a
      - 0
    - '>':
      - b
      - '_io.size'

I know that it is monstrouous space-wasting and unusable, but I'm not sure ;)

GreyCat commented 6 years ago

My point here is actually very simple: if we'll just do it as one expression — which we actually still can do, i.e. stuff like that is legal:

valid:
  expr: _ == 42 # the same as `eq: 42`
valid:
  expr: _ >= 42 # the same as `min: 42`

It's easy to do, but we lose declarativity this way. If it's only an expression, one can't just look at this validation spec and render it as a range input (for example, if a min/max boundaries are known), or as a combo box with a set of choices (if a set of choices is known). Basically, if we only have a validation expression, the only thing we can do (without reversing it and applying some symbolic solver) is checking every particular value for validity, and that's it. Declarative configuration offers more, for example, you can clearly get a list of valid values, or boundaries, or stuff like that. Proposed mechanism is also extensible. For example, if we'll want to add size limits of strings in future, we can do something like

valid:
  max-size: 42

and it could be rendered as string with limited input box, not asking a user to enter arbitrary string and then rejecting it for some mysterious reason.

KOLANICH commented 6 years ago

I haven't thought about using that info for metadata for GUI purposes. It's cool idea. In this case

reversing it and applying some symbolic solver

IMHO is the best way to do it. If we have problems with performance we can try to cache solver results.

It's easy to do, but we lose declarativity this way.

I guess we don't lose declarativity, since we still declare "this value must satisfy this logical expression" and all the needed info is there, but we lose some verbosity. What we win, is that we don't make an assumption about programmer using the proposed verbose syntax instead of exprs. So even if a programmer used exprs, if we have a system recovering these info from exprs, we would expect it working correctly, even retroactively, when a new extractor is added and it becomes working even on old specs without any modification

kalidasya commented 5 years ago

Please take into account bit sized values like:

- id: fixed_10
  type: b2
  contents: [2]

- id: fixed_10
  contents: [true, false]
GreyCat commented 5 years ago

I went forward and do some proof-of-concept implementation in current compiler. Namely:

generalmimon commented 5 years ago

The current KSC (0.9-SNAPSHOT) generates incorrect code when valid is used on an optional field (with if). The validation check is executed outside the if clause instead of inside it.

Test code:

meta:
  id: valid_if_test
seq:
  - id: optional
    if: false
    type: u1
    valid: 42

produces (in JavaScript, but it happens in all languages):

ValidIfTest.prototype._read = function() {
  if (false) {
    this.optional = this._io.readU1();
  }
  if (!(this.optional == 42)) {
    throw new KaitaiStream.ValidationNotEqualError(42, this.optional, this._io, "/seq/0");
  }
}
GreyCat commented 5 years ago

@generalmimon

The current KSC (0.9-SNAPSHOT) generates incorrect code when valid is used on an optional field (with if). The validation check is executed outside the if clause instead of inside it.

Probably moving it inside if would make sense, good catch! Can I ask you to contribute relevant test to tests repo?

generalmimon commented 5 years ago

Sure.

GreyCat commented 4 years ago

Recent compiler updates fixed ValidFailInst and ValidNotParsedIf tests by @generalmimon. Thanks for enhancing our test base!

suhr commented 4 years ago

Not closely related, but you may find it interesting how CUE unifies types and constrains: https://cuelang.org/docs/concepts/logic/

GreyCat commented 4 years ago

Implemented min, max and any-of clauses. Works for C++, JS, Python, Ruby. Pending runtime fixes for Java.

Mingun commented 4 years ago

May be name key expect? I think, is sounds better:


id: answer
expect: 0x42
expect:
  min: 42
  max: 42
  any-of: [answer]
  expr: _ == 42
generalmimon commented 4 years ago

AFAIK, this feature is basically implemented (all valid sub-keys described in https://github.com/kaitai-io/kaitai_struct/issues/435#issue-323332353 work), but no proper type checks if the type of the attribute can be compared with the type of the expressions in valid sub-keys (like eq, min, any-of, ...) are done yet. For example, this spec should raise the same "can't compare ... and ..." error as the == operator currently does:

meta:
  id: attr_valid_eq_bad
seq:
  - id: foo
    type: u1
    valid:
      eq: '"bar"'

So that means implementing the checks in compiler + adding relevant ErrorMessagesSpecs testing if the error message is correct. I think it makes sense to make the message look like that one in the existing test for == type mismatch: tests/formats_err/expr_compare_enum2.ksy:1

However, I think that this little thing can wait until we release the 0.9 version, so I'm moving this issue from the 0.9 milestone to 0.10.

sykhro commented 3 years ago

Given that the keyword is mostly implemented, is it time to add it to the .ksy schema? This would be beneficial for users (like me) that use a YAML language server for validation while making structures

krisutofu commented 2 years ago

I found out that it will only pay attention to the first valid key of the field but ignore the others.

seq:
  - id: first_byte
    type: u1
    valid:
      eq: 0x00
    # ignores this:
    valid: 0x01
    # ...
    valid:
      expr: _ >= clip_bottom
      # ignores these:
      expr: _ <= clip_top
      expr: _ == 0x7 or _ == 0x17 or _ == 0x23 or _ == 0x68 or _ == 0xB5

A reasonable behaviour would be a logical AND of all valid expressions.

It strictly separates min/max, eq, any-of and expr so that combinations of these are not accepted. The error message might be confusing, stating that the key is unknown. Writing it like this is not possible:

seq:
  - id: clip_bottom
    type: u1
  - id: clip_top
    type: u1
  - id: byte
    type: u1
    valid:
      min: clip_bottom
      max: clip_top
      any-of: [0x7, 0x17, 0x23, 0x68, 0xB5]

If this would work it would be a fair solution:

    # ...
    valid:
      min: clip_bottom
      max: clip_top
    valid:
      any-of: [0x7, 0x17, 0x23, 0x68, 0xB5]

The only combination, which works at the moment, is one-time min + max. Multiple min or max are ignored. A useful behaviour would be an n-ary min and max. But if these would be ANDed instead, multiple min would validate an infimum and multiple max a supremum.

This is not a pressing issue. There is no functional limitation. One could write the set of validation formulas as a single formula:

    # ...
    valid:
      expr: |
        _ >= clip_bottom and
        _ <= clip_top and
        (_ == 0x7 or _ == 0x17 or _ == 0x23 or _ == 0x68 or _ == 0xB5)
KOLANICH commented 2 years ago

I found out that it will only pay attention to the first valid key of the field but ignore the others.

.ksy is the file extension, where the letter y stands for YAML. What is placed into ksy files is a subset of YAML. YAML is the language to serialize and deserialize possibly nested data like in JSON.

in JavaScript it'd look like


let field = {};
....
field.valid = {eq: 0x00};
field.valid = 0x01;

As you see, the latter assignments override the former. When you store data it is clear that this must never happen. Different YAML parsers react differently on that. Some raise an error. Other ones just take one of the values and silently ignore the rest.

So your mental model is wrong, repeated keys don't mean what you assume them to mean.

krisutofu commented 2 years ago

... in JavaScript it'd look like

let field = {};
....
field.valid = {eq: 0x00};
field.valid = 0x01;

As you see, the latter assignments overide the former. When you store data it is clear that this must never happen. Different YAML parsers react differently on that. Some raise an error. Other ones just take one of values and silently ignore the rest.

So your mental model is wrong, repeated keys don't mean what you assume them to mean.

Sorry, thank you for clarifying.

Omar-Abdul-Azeez commented 1 year ago

Considering that we can assign contents: ["some string", 0x30, 30] shouldn't we also be able to assign

seq:
  - id: sig
    size: : 6
    valid:
      any-of: [ [ "AA", 0, 0, 1, 0 ] , [ "BB", 0, 0, 1, 0 ] ]

currently the compiler throws an error as it expects "string"? I assume that means KSY expressions so I tried [ '["AA", 0, 0, 1, 0]' , '["BB", 0, 0, 1, 0]' ] to no avail... can't combine output types: CalcStrType vs Int1Type(true)

Heck even the painful approach doesn't work:

seq:
  - id: sig1
    type: str
    size: : 2
    valid:
      any-of: [ "AA", "BB" ]
  - id: sig2
    contents: [ 0, 0, 1, 0 ]

KSC seems to evaluate AA and BB as KSY expressions which makes it search for ids matching those values but an id can only have lower case letters hence an error is thrown...

demberto commented 1 year ago

I accidentally discovered that this was a thing. When will the relevant keywords be added to the docs and the YAML schema (the kaitai-struct-vscode plugin uses it).

krisutofu commented 1 year ago

I accidentally discovered that this was a thing. When will the relevant keywords be added to the docs and the YAML schema (the kaitai-struct-vscode plugin uses it).

Which version of the VSCode extension do you have? I have v0.8.1 and it does not have the valid-expressions included. They are underlined red in my version. It seems to be outdated sinde it doesn't know the bit-endian meta property.

A year ago, I tried to contribute documentation about the valid property but it was entirely ignored. Here it is: valid expression documentation (Jan, 2022)

demberto commented 1 year ago

Which version of the VSCode extension do you have? I have v0.8.1

Same.

generalmimon commented 1 month ago

I almost forgot to mention that there is now a new subkey of valid called valid/in-enum based on the idea in https://github.com/kaitai-io/kaitai_struct/issues/435#issuecomment-389597122. I suppose this is also related to https://github.com/kaitai-io/kaitai_struct/issues/778, because it was also discussed there. It's implemented for all 9 languages where valid is implemented in the first place (well, after checking ValidFailInEnum at https://ci.kaitai.io/, except for Rust, which would be the 10th such language). The only thing I changed was the name. IMHO valid/in-enum sounds more like a validation constraint than valid/enum-key, and it seems to be a more intuitive name when you see it in an actual .ksy spec like this:

seq:
  - id: foo
    type: u4
    enum: animal
    valid:
      in-enum: true

enums:
  animal:
    4: dog
    12: chicken

valid/in-enum: true checks that the parsed enum value is "in enum", i.e. that it is a known value defined in the enum specified by the enum key. It's functionally equivalent to using valid/any-of with an exhaustive list of all enum entries:

seq:
  - id: foo
    type: u4
    enum: animal
    valid:
      any-of:
        - animal::dog
        - animal::chicken

enums:
  animal:
    4: dog
    12: chicken

... but obviously more convenient to use and easier to maintain. Of course, this valid/any-of pattern still has a use case when you only want to allow a subset of enum values, but if the intent is that it can be any value in the enum, valid/in-enum: true should be preferred.

It's intended only for fields with the enum key. If used on a field without it, a compilation error is planned, but it's not implemented yet at the time of writing (as with most other valid compile-time checks).

In case you were wondering, true is the only allowed value for valid/in-enum. valid/in-enum: false won't work, and there's an already passing test for that:

attr_bad_valid_in_enum_false.ksy: /seq/0/valid/in-enum:
    error: only `true` is supported as value, got `false` (if you don't want any validation, omit the `valid` key)