kaitai-io / kaitai_struct

Kaitai Struct: declarative language to generate binary data parsers in C++ / C# / Go / Java / JavaScript / Lua / Nim / Perl / PHP / Python / Ruby
https://kaitai.io
3.96k stars 192 forks source link

Add built-in process `hex` and `base64` #668

Open Mingun opened 4 years ago

Mingun commented 4 years ago

This is two widely used encoding schemas, so it will be great, if kaitai will have built-in primitives for this.

GreyCat commented 4 years ago

We're not adding more built-in process anymore, given that we have pluggable modules now. We'll have series of libraries that will have these widely used procedures instead. See https://github.com/kaitai-io/kaitai_compress, for example, for popular compression algorithms.

Please consider contributing something like that, but for hex and base64?

Mingun commented 4 years ago

Ok, that is appropriate solution (but when it will be implemented will be good to have them available in webide).

Is there any recommendations, how to create processor for all supported languages and how end users should get these algorithms in their applications?

GreyCat commented 4 years ago

but when it will be implemented will be good to have them available in webide

Yep, that's the plan — like all these "common" libraries will be automatically available in WebIDE together with all their dependencies.

Is there any recommendations, how to create processor for all supported languages

Custom processors are documented in http://doc.kaitai.io/user_guide.html#custom-process. Per-language specifics are supposed to be documented in per-language notes on https://doc.kaitai.io, but in reality we're lagging behind on that documentation updates. Probably your best bet would be to copy the existing layout of kaitai_compress and start something like "kaitai_common" or "kaitai_misc" collection of algorithms.

and how end users should get these algorithms in their applications?

Installation is obviously language-dependent and is outlined around Usage section in kaitai_compress.

KOLANICH commented 4 years ago

process works on raw bytes. Hex and base64-encoded values are strings. I mean they may be utf-32be, or utf-16be, or utf32le... So, I guess process is a bit unsuitable here.

GreyCat commented 4 years ago

Makes sense, but in reality 100% of hex dumps I've seen so far were in ASCII. I can imagine a hex dump in UTF16, but we might just introduce special parameter for that in processing routine, or may be a special routine for these purposes.

Even from performance-related side, it doesn't make much sense to "real" conversion of that data to strings first, and then do a string-to-integer conversions.

KOLANICH commented 4 years ago

Even from performance-related side, it doesn't make much sense to "real" conversion of that data to strings first, and then do a string-to-integer conversions.

From performance side decoding a sequence of bytes of known length into an ASCII/UTF-8 string should be an O(1) operation (it is just reinterpreting raw memory). If it is not the case, it is definitely a bug in the language.

but we might just introduce special parameter for that in processing routine

It is conceptually wrong. We have strings and we have encoding for them. So we probably need not processors, but just support of externally-defined functions (and we definitely should have interfaces for that because we wanna validate the stuff in transpile time).

Or just external opaque types can be used for that. Interfaces here are not just needed, but mandatory because props are involved.

Mingun commented 4 years ago

Hex and base64-encoded values are strings.

Not absolutely. By definition of this conversions they converts any byte sequence to 7-bit byte sequence (i.e. to ASCII encoded strings), that can be safely transferred through some old protocols. As strings they represented only for stupid humans (glory to robots!)..

However, it is possible to solve this problem if we will represent that byte sequences as strings in ksy with defined hex or base64 encoding in the same way as we represent strings with ASCII or UTF-8 encoding (by the way, what encodings should be guaranteed to be supported by any kaitai-struct runtime?).

GreyCat commented 4 years ago

what encodings should be guaranteed to be supported

See #116 and #393.

dgelessus commented 4 years ago

process works on raw bytes.

Any reason not to support process for strings?

The performance of the bytes-to-string conversion is unlikely to be an issue for ASCII - any decent language has optimizations for that common case (I know at least Java and Python do).

Conceptually I think hex/base64-encoded data should count as text strings. Hex is usually used to store arbitrary binary data in a format that can be read by humans (i. e. text), and nowadays base64 is almost exclusively used to convert arbitrary binary data to printable, ASCII-compatible text.

(Yes, base64 was originally developed to transfer 8-bit data over channels that might only be 7-bit and could clobber the 8th bit, but if you're parsing that kind of data you probably need to strip the 8th bit beforehand anyway.)

KOLANICH commented 4 years ago

Any reason not to support process for strings?

Because process by definition works before any parsing of a field is done. The generated code

  1. carves the field
  2. processes it
  3. does parsing on processing result

It is a bytes-level operation.

dgelessus commented 4 years ago

Good point, you still need to be able to use a regular byte process on string fields.

Perhaps the hex/base64 decoding should be done using string methods instead (i. e. something like string_field.decode_hex, which returns a byte array). There should be no need for an attribute ("process-str") here - a method call in a value instance would work just as well.

KOLANICH commented 4 years ago

Making it a method will require it to be a part of every runtime. It'd be better to make it a separate auxilary package. So IMHO it is better to have it as just a function.

GreyCat commented 4 years ago

"Function" is actually the worst possible choice for such stuff — it's imperative, you basically show how to do transformation one way and it's very untrivial to do it the other way around. Things like process make it much more declarative:

Mingun commented 4 years ago

I think, we can add another process phase. Right now there is situation, when process actually must be named pre-process. So it just needed to add post-process, that will transform parsed result to final form.

Then, we can write:

  - id: mac
    doc: Message Autentification Code (HEX)
    size: 8
    post-process: hex
    expect: _.size == 4 # valid from #435 , but that name is better, IMHO

This mean: read 8 bytes, then apply hex transformation (which, by convention, actually applied unhex transformation -- from HEX to bytes). Finally, assert, that size of result array is 4 bytes, as it should, just to clarify

KOLANICH commented 4 years ago

instances are already present.

Mingun commented 4 years ago

Yes. Actually, in case of hex and base64 even post-process is not required, because:


Do you think you can add these algorithms to your katai_compress or better implement them in a separate repository

As you think, can that algorithms to be added to the https://github.com/kaitai-io/kaitai_compress (and maybe rename it to more generic kaitai_algorithms), or better implement them in a separate repository?

KOLANICH commented 4 years ago

As you think, can that algorithms to be added

you may have meant

How do you think, if that algorithms can be added

.

I personally pretty sure that it will be never merged that way. I mean IMHO we don't need hex and base64 in decoders. We need it, but on other layers. These other layers are custom types. So feel free to create a repo of custom types with processors that cannot be implemented in KS only. Also look at my PRs into KSF, they contain code for some of such types.

and maybe rename it to more generic kaitai_algorithms

I have thought about renaming the kaitai_compress repo into kaitai_processors (and I have an own extended and refactored fork of that repo, not yet merged), but we strictly need interfaces #314 first because of serialization.