kaitai-io / kaitai_struct

Kaitai Struct: declarative language to generate binary data parsers in C++ / C# / Go / Java / JavaScript / Lua / Nim / Perl / PHP / Python / Ruby
https://kaitai.io
4.04k stars 199 forks source link

Consistently handle parsed enum values that are not listed in the enum #778

Open dgelessus opened 4 years ago

dgelessus commented 4 years ago

At the moment, Kaitai Struct doesn't consistently handle parsing of enum values that are not defined in the ksy. For example, with a spec like this:

seq:
  - id: value
    type: u1
    enum: thing
enums:
  thing:
    0: zero
    1: one
    2: two

When parsing the byte array [0x03], i. e. with the value field set to 3, which is not listed in the thing enum, it's not clear what's supposed to happen. The documentation doesn't say anything about this case, and the behavior is inconsistent across target languages. Here's a summary of how each target language currently handles this case (I don't know all of these languages very well, please correct me if any of this is wrong):

In summary, most targets allow enum fields to have values that are not listed in the ksy, but some of the targets replace unknown values with null/nil or treat them as an error. I think it would make sense to change the behavior of the remaining target languages to allow unknown values as well.

The main problem with this is that unknown enum values are difficult to represent in languages like Java, which are strongly typed but don't have native support for something like strongly typed unions ("either" types) or case classes. Allowing unknown enum values would require defining/generating a custom wrapper class that can contain either a known enum object value or an unknown raw integer value, but this would make them less convenient to use in code that doesn't care about unknown enum values.

Related to #300 and #568, and somewhat related to #518.

ams-tschoening commented 4 years ago

I've been hit by this as well today again and in my opinion there are three choices, which some KSY-author could choose amongst when defining an enum and how to react regarding unknown values. In my opinion it's important to understand that NOT the target language defines how to deal with those cases, but really the KSY-author depending on the nature of enums.

While all target languages need to implement all three choices in the end, it would be totally OK to simply throw exceptions in the compiler if some target language lacks support. This is sometimes done already and allows KSY-authors to either implement support or choose some other strategy already supported. In any case it's better than now, because authors would be made aware of some problem, while currently things might fail or behave unexpected in different target languages on runtime.

Throw exceptions on unknown values.

Sometimes unknown values are simply wrong from the point of view of an author of some format, but people could provide arbitrary wrong values. While it's sometimes possible to check those things with validations, it doesn't make much sense to force KSY-authors to do so when they have already decided about a whitelist of valid values. They wouldn't use an enum otherwise. Simply give them the power to decide to throw exceptions, which is something I would implement in custom Java-code in those cases most likely as well.

Would be important to provide a good error message, though, containing the invalid value and not only a message about "things" being unsupported or such. Some syntax to support this could simply be a specially named key, triggering KS to generate throwing code and maybe even provide a custom prefix for the error message.

enums:
  ip_protocol:
    1: icmp
    6: tcp
    17: udp
    throw|die: ""|"Unsupported protocol, details: "

Map unknown values to a concrete enum-instance.

This is especially useful in cases where one has pretty complex formats with some types able to represent a lot of different decisions, while one simply wants to focus on only some of those. It's not that all other values are invalid and exceptions need to be thrown, it's only that one doesn't care about anything else "right now". A good syntax to support this would be like with default branches for switch-on:

- id: body:
  type:
    swith-on: type
    case: 
      'type_code::type1': type1
      'type_code::type2': type2
      _: generic_type

vs.

enums:
  ip_protocol:
    1: icmp
    6: tcp
    17: udp
    _: unknown|not_interested|dont_care|...

The difficult thing is which value should be stored in those cases? For Java, code is rendered which uses long, but how can KS easily know what a "valid" unknown value looks like? Things like 0 and -1 could easily be used for the enum already. Maybe switching to storing objects is a good choice, making null available. That concept is available already anyway in most/all target languages because of instances.

Besides that, things might be easy for languages like Java using maps, providing accessors with default values:

public static ParOmsAplVibUnit byId(long id) { return byId.get(id); }

vs.

public static ParOmsAplVibUnit byId(long id) { return byId.getOrDefault(id, UNKNOWN|NOT_INTERESTED|DONT_CARE|...); }

Custom wrapper providing known and unknown values.

That's what @dgelessus suggested already: A wrapper tells users if a known or unknown value is available and in the latter case makes the unknown one available. It's somewhat how Optional in Java or other languages work. Once supported for a target language, this could even be made the default, so no special syntax in KSY would be necessary. At least for statically typed languages, this would break backwards compatibility, but might be a good chance for users of KS-generated code to have a look about how they use enums. Might be better than failing on runtime.

For dynamically typed languages, this wouldn't be that easy and might be a strong reason against making this the default. Not sure about a good syntax then:

seq:
  - id: protocol
    type: u1
    enum: ip_protocol
enums:
  ip_protocol:
    1: icmp
    6: tcp
    17: udp
    wrap[per]: true|false|???

false could simply be mapped to throw an exception with a default error message again.

dgelessus commented 4 years ago

@ams-tschoening Thank you for the detailed write-up! While reporting the issue I also had thoughts about supporting switching between the different behaviors, but I couldn't get my thoughts properly organized, so I left that part out of my initial comment :)

I agree that it would make sense to allow switching between the three different behaviors for handling unknown values (throw error, replace with single "unknown" value, or preserve exact value). But I'm not sure if this choice should be hardcoded into the spec (or at least not only in the spec), because different users of the same spec might want different behaviors for unknown enum values.

As an example, say you have a TLV-style format, where you have blocks of data prefixed with a type code that defines the meaning of the following data. The format defines a few type codes, and those are the only allowed codes - use of any other type codes is invalid according to the format (or perhaps indicates a newer format version). As a spec writer, the first behavior (throw errors on unknown enum values) would be the correct one to choose here. However, as a user of the spec, any of the three behaviors might be the most suitable, depending on the use case:

On the other hand, I agree that the spec writer should also have some control over the behavior (at least the ability to set a default), because depending on the format, different behaviors are more likely to be the best for most people. For example, if a format spec says "these are the allowed type codes, all others are invalid and can never occur", then it would make sense to use the first behavior (throw errors) unless the user of the spec overrides it. On the other hand, if the format spec says "these are the standardized type codes, but others are valid and must be preserved when the data is modified", then the third behavior (preserve unknown values) would be required by most users.

Although I suppose it would still be a valid solution to have the error behavior controlled only by the spec. This way, spec writers would choose the behavior that best matches the official definition of the format, and any users that need a different behavior for specialized use cases have to modify the spec. It might not be ideal for all users, but it would probably be the simplest solution to understand and implement.

dgelessus commented 4 years ago

Also, on the topic of preserving unknown enum values in Java, I had another idea that would be less user-unfriendly than the wrapper class solution. Java enums are allowed to implement interfaces, so it would be possible to represent each KS enum as an interface with two implementations: the enum, which lists all known values, and a separate class that wraps unknown values. For example, if you have this KS enum:

enums:
  compression_algorithm:
    0: uncompressed
    1: gzip
    2: bzip2
    3: xz

It might be compiled to this Java code:

Generated Java code (example) ```java public interface ICompressionAlgorithm { int id(); static ICompressionAlgorithm byId(long id) { CompressionAlgorithm known = CompressionAlgorithm.byId(id); if (known != null) { return known; } else { return UnknownCompressionAlgorithm.byId(id); } } } public enum CompressionAlgorithm implements ICompressionAlgorithm { UNCOMPRESSED(0), GZIP(1), BZIP2(2), XZ(3), ; private final long id; CompressionAlgorithm(long id) { this.id = id; } @Override public long id() { return id; } private static final Map byId = new HashMap(4); static { for (CompressionAlgorithm e : CompressionAlgorithm.values()) byId.put(e.id(), e); } public static CompressionAlgorithm byId(long id) { return byId.get(id); } } // might not even need to be public public class UnknownCompressionAlgorithm implements ICompressionAlgorithm { private final long id; private UnknownCompressionAlgorithm(long id) { this.id = id; } @Override public long id() { return id; } public static UnknownCompressionAlgorithm byId(long id) { // Optional: add memoization to avoid allocating multiple objects for the same ID. return new UnknownCompressionAlgorithm(id); } // Insert equals, hashCode, toString, etc. boilerplate here, // which need to be implemented manually, unlike in enums. } ```

Then a struct field algo with enum: compression_algorithm would be compiled to a field of type ICompressionAlgorithm whose value is created using ICompressionAlgorithm.byId(...). The field's value would be one of the CompressionAlgorithm enum constants if the value was known, or an instance of UnknownCompressionAlgorithm otherwise. Here are some examples of how this field would be used in Java:

// Check against known value
if (struct.algo() == CompressionAlgorithm.UNCOMPRESSED) {/* ... */}

// Extract raw value
System.out.println(struct.algo().id());

// (Note that the two examples above don't need any casts or type checks,
// because id() is part of the interface and works with known and unknown values.)

// Ensure that value is known
CompressionAlgorithm known_algo = (CompressionAlgorithm)struct.algo();

// Switch over known values
switch ((CompressionAlgorithm)struct.algo()) {
    case UNCOMPRESSED: // ...
    case GZIP: // ...
    // ...
}

// Switch over known or unknown values (this is not very nice sadly)
if (struct.algo() instanceof CompressionAlgorithm) {
    switch ((CompressionAlgorithm)struct.algo()) {/* like above */)}
} else {
    handleUnknown(struct.algo().id());
}

// Alternative version of the above
// (allows single else case for both "known but unhandled value" and "unknown value")
if (struct.algo() == CompressionAlgorithm.UNCOMPRESSED) {
    // ...
} else if (struct.algo() == CompressionAlgorithm.GZIP) {
    // ...
} else {
    handleUnknown(struct.algo().id());
}
ams-tschoening commented 4 years ago

[...]But I'm not sure if this choice should be hardcoded into the spec (or at least not only in the spec), because different users of the same spec might want different behaviors for unknown enum values.

Makes sense, but things a lot more complicated as well I guess.

Letting users decide sounds like something needed at runtime, which makes an approach like opaque types or custom processing routines come into my mind. Either of both would be needed to implement the lookup of decimals to enum-instances itself only, with some default behaviour according to the spec-author. So KS would need to provide a base implementation which could then be overridden by spec-users with a custom implementation somehow.

Or KS would provide some interfaces in the runtime used during resolving enums, which need to be implemented by users, registered in the runtime and are used during resolving enums. Things easily get difficult here.

I wouldn't give that any priority right now. Letting KSY-authors switch between the three different behaviours seems more important to me. The docs could simply mention that e.g. throwing an exception might not be the most compatible behaviour for users, so KSY-authors should really be sure. Especially the wrapper preserving original values would be really compatible for all use cases, so might additionally make sense to make that the default.

KOLANICH commented 4 years ago

Would be important to provide a good error message, though, containing the invalid value and not only a message about "things" being unsupported or such.

No. Instead the exception should be specific enough and contain the info in the machine-readable way. Human-readable error messages is NOT KS was designed for. Only embedders of specs know the best way they should process the errors, not specs authors, not KSC.

A wrapper tells users if a known or unknown value is available and in the latter case makes the unknown one available.

For dynamically typed languages, or object oriented languages where enums are inherited from integers it is not needed. For C++ such a wrapper can be created, even with conversion operators, allowing the objects be used as numbers, requiring no changes in the code embedding the spec.

But I'm not sure if this choice should be hardcoded into the spec (or at least not only in the spec), because different users of the same spec might want different behaviors for unknown enum values.

Both agree and disagree. Agree that it should be redefineable, finely-grained redefineable, and disagree that it shouldn't be hardcoded - it is format that define tue semantics. In TLV unknown items can usually be just skipped, as their size is known. In TV the size is defined by the enum value, so the failure to recognize the type will ruin parsing.

Wnd the compiler can derive the default mode from the code itself following the rule:

And in some cases the default hook can be generated automatically by KSC, such as an array of items, each one having a common signature, where it is impossible or extremily unlikely that the signature is present in a valid item body. Then we can just scan for the signature.

ams-tschoening commented 4 years ago

No. Instead the exception should be specific enough and contain the info in the machine-readable way.[...]

"Machine-readable way" means unnecessary format decisions about what is provided how for whom. Instead, one can easily focus on providing somewhat useful error messages telling what the problem is and for which data it occurred. A lot of implementers simply won't care about those exceptions anyway and simply let things bubble up the stack. More importantly, KS simply provides human-readable exception messages already in the validation infrastructure of the runtime.

public static class ValidationGreaterThanError extends ValidationFailedError {
    public ValidationGreaterThanError(byte[] expected, byte[] actual, KaitaiStream io, String srcPath) {
        super("not in range, max " + byteArrayToHex(expected) + ", but got " + byteArrayToHex(actual), io, srcPath);
    }

That error message isn't very machine-readable as well, but totally fits its purpose: It's easy to understand for humans and if someone wants to parse it, it's not too difficult as well. So there's absolutely no need to raise the bar for no real benefit here and make things unnecessary difficult.

For dynamically typed languages, or object oriented languages where enums are inherited from integers it is not needed.[...]

But simply not all target languages behave that way and not all use cases are compatible with that. I'm using Java mostly and have a lot of use cases in which I'm forwarding the NAME of an enum into some config file to get compared with other names. This improves maintenance and documentation of the config files and simply doesn't work anymore if decimals instead of names occur, because the interface is accepting strings only currently. Therefore I often prefer some fallback like UNKNOWN in favour of a concrete value in those cases. Without such fallback, a wrapper would make clear on compile time already that I need to decide how to deal with unknown values, if e.g. simply converting decimals to strings as well or throw exceptions, or map manually to UNSUPPORTED or whatever.

It totally makes sense for KS to provide an own concept across all supported target languages. Each target language could easily add custom magic as needed in future to make life for those users easier, but the point is that such magic isn't available in all target languages anyway.

[...]In TLV unknown items can usually be just skipped, as their size is known. In TV the size is defined by the enum value, so the failure to recognize the type will ruin parsing.

Keep in mind that sometimes enums are really only used to provide names to decimals and don't influence further parsing in any way. I do this a lot to make some data structure easier to understand. In those cases there won't ever be any parsing failure, it's only that users of specs sometimes run into NullPointerException or alike when trying to access names of unknown values. Those use cases need to be handled explicitly by the KSY-author anyway, so it makes sense to start with explicit things and make the compiler smarter as necessary in future.

KOLANICH commented 4 years ago

Keep in mind that sometimes enums are really only used to provide names to decimals and don't influence further parsing in any way. I do this a lot to make some data structure easier to understand.

Absolutely. And while reverse-engineering or debugging, we clearly should not fail on unknown enum values, if they don't affect parsing. And in production we usually shouldn't too, because these values may just mean unimplemented extensions which can be safely (for the purpose of parsing) skipped.

it's only that users of specs sometimes run into NullPointerException

I completely agree that in these cases when a num doesn't match the enum, it must be replaced by the num rather the breaking behavior.

it makes sense to start with explicit things and make the compiler smarter as necessary in future.

Completely true, without explicit things we cannot properly test implicit things that must result into the same code.

More importantly, KS simply provides human-readable exception messages already in the validation infrastructure of the runtime.

Instead it should provide them machine-readable. All the exceptions must be macuine-readable. Exceptions can be used for direct consumption of their string representation by a user in the case of program failure, but they must be machine-readable because their primary purpose is to be caught and processed.

But simply not all target languages behave that way

I was speaking about the ones behaving the needed way. And I forgot to add - the langs with runtime type identification. So they don't need external bits as we will have to use in C++.

BTW, for C++ IMHO we SHOULDN'T verify the enums at their parsing. It may carry a large overhead, depending on the structure of the enum, and in the most of cases is completely unnneeded. Let it be just a number (in a enum class sense) or just a struct incorporating a number field with a pair of additional methods. Verifying is already done on switch-ons using them. But we probably should generate a table of static strings for human-friendly output.

The problems are langs verifying them. Should we introduce own classes for the enums, like it is proposed to do in C++ for all of them, including python, allowing to avoid verification where it is unnneded?

dgelessus commented 4 years ago

Letting users decide sounds like something needed at runtime

To clarify: with "users" I meant "users of the spec", i. e. developers who want to compile and use an existing spec (from KSF or another repo) that they aren't developing themselves. My thought was to maybe add a compiler flag for overriding the enum error behavior, to allow selecting a different behavior if the one chosen by the spec author doesn't fit your use case. But perhaps that really isn't worth the added complexity and we should just ask people to modify the spec if they need a different error behavior.

the exception should be specific enough and contain the info in the machine-readable way

I would argue that if you need machine-readable error information, you should be using the behavior that preserves invalid values. This gives the code that uses the spec much more context than machine-readable information on an exception.

For dynamically typed languages, or object oriented languages where enums are inherited from integers it is not needed.[...]

But simply not all target languages behave that way and not all use cases are compatible with that.

I'm not sure what the point is here? Different target languages can have different implementations of the same feature. So where Java would require a wrapper class or interface, Python could instead directly store either the enum value or raw integer, and C++ would simply cast the raw value to the enum type without checking (perhaps with an extra generated function to allow checking if a value is actually valid for the enum, which can be manually called by user code if it needs to know).

ams-tschoening commented 4 years ago

To clarify: with "users" I meant "users of the spec", i. e. developers who want to compile and use an existing spec (from KSF or another repo) that they aren't developing themselves. My thought was to maybe add a compiler flag for overriding the enum error behavior[...]

That's how I understood you already, but a compiler flag 1. adds complexity as well and 2. isn't even sufficient in theory: Not all users compile from KSY-files themself, I e.g. have customers to which I deploy KSY-based C# and Ruby. While I could provide them KSY-files in theory, they don't want to take care about compiling those and to be honest, my company doesn't want to provide them the KSY-source for business reasons currently as well.

So one would need something available on compilation results as well or we simply don't care too much currently. I prefer the latter, the mentioned 3 choices most likely fit for 9X% of the use cases already.

But simply not all target languages behave that way and not all use cases are compatible with that.

I'm not sure what the point is here? Different target languages can have different implementations of the same feature.[...]

My argument is to really provide a KS-specific wrapper for all ENUMs if KSY-authors choose so, pretty much like KaitaiStream provides a common interface for all target languages. That assures that the same methods AND conceptually the same behaviour is available to all target languages, simply because that is what the KSY-author has chosen. That author WANTS that users need to explicitly decide how to handle unknown values, otherwise the wrapper wouldn't be used.

One and the same ENUM might be used in different places, where it sometimes is OK to map to some UNKNOWN, while at other places it's necessary to throw exceptions and at third places one might easily forward unknown decimals as-is. With the wrapper available, one can decide at each place individually. This would be more difficult and less reliable if too much magic is involved, especially if that magic differs between target languages. Reliable and consistent behaviour across languages is what this issue is about in the end. :-)

KOLANICH commented 4 years ago

you should be using the behavior that preserves invalid values. This gives the code that uses the spec much more context than machine-readable information on an exception.

Of course the original values should be preserved. They are always preserved in the cases they are valid, we should also preserve them for the runtimes cleaning them for the cases they are invalid. The exception should not carry a preformatted error message, but should carry the necessary info to be able to format it. Not Exception("1 is invalid for enum A in member b, must be in range [2, 8]"), but RangedEnumException("b", "A", structRef, 2, 8) if the enum structure is sequential (the compiler has to detect enum structure in order to generate optimal code, i.e. sequential enum will mean a boundary check in switch-on and a look-up table instead of else if ladders). Generatinga human-readabpe text is not KS business because the text is very app-specific and context-specific and may even involve lookup in translation databases.

I'm not sure what the point is here? Different target languages can have different implementations of the same feature. So where Java would require a wrapper class or interface, Python could instead directly store either the enum value or raw integer, and C++ would simply cast the raw value to the enum type without checking (perhaps with an extra generated function to allow checking if a value is actually valid for the enum, which can be manually called by user code if it needs to know).

It is proposed to not to use the native enums directly, but to create a wrapper class for them, if it is needed. The wrapper class does the following:

  1. when initialized, it just stores the number, without any checking
  2. has means to cast that number to a native enum, with checking, or without. It should be possible to do it implicitly, if it is possible.
  3. has means to check validity of the enum value.
  4. has means to generate a string representation. Likely just a data structure lookup, when the data structure is precomputed.
dgelessus commented 4 years ago

My argument is to really provide a KS-specific wrapper for all ENUMs if KSY-authors choose so, pretty much like KaitaiStream provides a common interface for all target languages. That assures that the same methods AND conceptually the same behaviour is available to all target languages, simply because that is what the KSY-author has chosen.

Yes, KS should have the same functionality and parsing behavior in all languages, but that doesn't mean that the interface should be identical across all languages - it should still follow the conventions of the target language.

Python is dynamically typed, so I think it's completely okay (and perhaps even expected) to have a field that can contain two possible types of values, and the developer is allowed to assume one or the other without explicitly checking/casting/unwrapping the value first. KS already behaves similarly in other cases, for example with switch-on types - in Python you can access the fields of one specific case without first having to check that that case was actually taken.

That author WANTS that users need to explicitly decide how to handle unknown values, otherwise the wrapper wouldn't be used.

I don't think that's true in general - just because a spec allows unknown values in an enum field doesn't mean that all code that uses that spec needs to care about those values and should be forced to explicitly handle them. Someone who is writing a quick-and-dirty script that only needs to handle a limited subset of the format, or even working interactively in the REPL, might not care about handling (or explicitly not handling) every edge case. If you want to be really sure that all possible cases are handled correctly, then you should be using a statically typed language that has such checks built into the language, and not Python.

TwoClocks commented 2 years ago

This has been sitting for a while. Has there been any progress on this?

My two cents:

I think it should be part of the .KSY spec if an unknown value for an enum is "ok" or an error. There are valid reasons for both, and whoever is writing the spec should know, and the difference should be propagated to the runtime code.

Examples: Not knowing the protocol # in the IP header is ok. There is a payload you don't understand, but you have enough info to move to the next packet to continue parsing.

But there are protocols where identifying the type is the only way you know how to process the following data. You can not "skip". That is a catastrophic error.

Same w/ just simple enums that don't have any effect on parsing. You have protocols like the automotive & marine "CAN bus" family, where manufacturers are always adding new enum value for their specific hardware. It's not an error to not understand them. This is really common when you only have a reversed engineered protocol.

But for financial transactions, parsing an order that has a known side (e.g. buy/sell), that's a catastrophic error. The run-time should have to do work to choose to continue, the default should be to fail.

I don't have an option about how to encode it in the .KSY file, but I think it needs to be there, and the concepts of "unknow ok" and "unknown error" enums need to be understood by the runtime.

KOLANICH commented 2 years ago

I don't have an option about how to encode it in the .KSY file

valid: enum

but it is not yet implemented and authors currently have to repeat oneself by using valid: any-of

TwoClocks commented 2 years ago

valid is not mentioned anywhere in the documentation, and not much turns up on a google search either. Where can I find an explanation of the meaning, and possibly an example?

KOLANICH commented 2 years ago

435 and #775 is the spec for it

https://github.com/kaitai-io/kaitai_struct_formats/pull/544/files#diff-c0b662ed38950bb3eec0b67908497abc5733dc69a5f261ce7687705c96c8ccd9R33-R36 using valid: value

Just usage of any-of: https://github.com/kaitai-io/kaitai_struct_formats/pull/557/files#diff-c3379de4c726c490909b58485b9610a2bc23bfb04e2aa673e3ad080e4969c545R29-R34

I remember there was an example using any-of on enum values, but I couldn't find it easily and was not desperate enough to write a script doing it via gh api.

generalmimon commented 2 years ago

I remember there was an example using any-of on enum values

@armijnhemel uses them extensively in his https://github.com/armijnhemel/binaryanalysis-ng repo, just search for any-of. For example:

src/parsers/filesystem/qcow2/qcow2.ksy:26-32

  - id: crypt_method
    type: u4
    enum: crypt_methods
    valid:
      any-of:
        - crypt_methods::no_encryption
        - crypt_methods::aes