kaitai-io / kaitai_struct

Kaitai Struct: declarative language to generate binary data parsers in C++ / C# / Go / Java / JavaScript / Lua / Nim / Perl / PHP / Python / Ruby
https://kaitai.io
3.95k stars 192 forks source link

Any port for plain C? #263

Open Zorgatone opened 6 years ago

Zorgatone commented 6 years ago

Hi, I would like to know if you would consider (or have any plans already) to port the project for use with "plain" C (other than C++ and C#). I would use it, and not all the systems (even embedded maybe?) support C++ and/or C#. Having a C version would enable portability on any system and even more languages with C bindings

GreyCat commented 6 years ago

You're completely correct, C port has been in heavy discussion since almost the very beginning of the project, yet nobody ever created an issue about it (and that's bad, because it's hard to collect all these discussions in one place).

There are/were several major issues with C target, though. It become a somewhat lengthy review of what's been discussed over the years, but I believe I've remembered most of the points and tried to order them from most serious to least serious.

Completely different workflow in mind

It turns out that most people who need C support in KS have completely different workflow that what KS provides now. Right now, KS does a very simple thing: it gets a binary format serialization spec and generates API around it. It usually does zero transformations, except for very simple and technical ones (i.e. endianness and that kind of stuff) — whatever's in the format, it all is reflected exactly as is in the memory. C people usually strive for performance and memory efficiency and would prefer to not save stuff that can be used right away and then just thrown out.

A very simple example:

seq:
  - id: len_foo
    type: u2
  - id: foo
    size: len_foo
    type: str

This is usually ok for many modern languages, but a lot of people who wanted C target automatically suggest that:

A more complex (and real-life) example is a typical parsing of any network packet, for example, an udp_datagram. Typical current vision of what KS might create is something like this:

typedef struct udp_datagram_t {
  uint16_t src_port;
  uint16_t dst_port;
  uint16_t length;
  uint16_t checksum;
  char* body;
} udp_datagram_t;

udp_datagram_t* read_udp_datagram(kaitai_stream* io) {
  udp_datagram_t* r = (udp_datagram_t*) malloc(sizeof(udp_datagram_t));

  r->src_port = read_u2be(io);
  r->dst_port = read_u2be(io);
  r->length = read_u2be(io);
  r->checksum = read_u2be(io);
  r->body = read_byte_eos(io);

  return r;
}

It turns out that many users would be comfortable with completely different mechanism than "read function just fills in some structures in memory and returns a pointer to them":

void read_udp_datagram(kaitai_stream* io, udp_diagram_callbacks* callbacks) {
  uint16_t src_port = read_u2be(io);
  if (io->status != OK) {
    udp_diagram_callbacks->on_error(io->status);
    return;
  }
  udp_diagram_callbacks->on_read_src_port(src_port);

  // ...
}

Not an "everything is an expression" language

Simply put, almost everything we had before supported "every KS expression translates into target language expression" idiom. That is, if you need to do string concatenation, i.e.

seq:
  - id: a
    type: strz
  - id: b
    type: strz
instances:
  c:
    value: a + b

... you do that a + b in one single-line expression everywhere. Even C++ allowed us to get away with a + b using std::string. In C, however, it traditionally boils down to many lines and temporary variables:

// Real-life code would be even more complex, probably with more checks, etc.
size_t len_a = strlen(a);
size_t len_b = strlen(b);
char *tmp = (char *) malloc(len_a + len_b + 1);
memcpy(tmp, a, len_a);
memcpy(tmp + len_a, b, len_b);
tmp[len_a + len_b] = 0;

This issue, however, was more or less solved with advent of #146.

Complex memory management

What's not solved however, is that such arbitrary allocations of temporary variables sometimes result in more complex memory management and need for additional manual cleanup. In the example above, tmp would likely be used directly as c value, and thus there's no need to store it additionally. However, if multiple operations will occur, we'll either need to store these intermediate values, or use some clever logic to either reusing these temporary buffers (and/or avoid extra copying), or clean them up right after they're no longer needed (i.e. earlier than in object's destructor).

Actually, even "allocate everything on the heap" is not universally agreed upon in many C apps. So, typical parsing of user-defined type like that:

udp_datagram_t* r = (udp_datagram_t*) malloc(sizeof(udp_datagram_t));

might be suggested to be replaced with passing a ready-made pointer to structure to fill into that read_* functions and creation of that udp_datagram_t on a stack of the caller instead.

No single standard library

For KS, we need some basic stuff like:

typedef byte_array {
    int len;
    void* data;
} byte_array;

There are tons of "enhanced standard" libraries that do that, but there's no universal agreement on that. Probably roughly 80% of C applications roll something homebrew like that inside them. Out of "standard" implementations, there is glib, klib, libmowgli, libulz, tons of lesser known libraries, there's a huge assortment of string-related libs, array-related libs, etc. Out of them, probably glib is most well-known and well-maintained, but even a suggestion to use that frequently encounters a huge resistance in many C developers.

Another possible way (albeit not way too well-received by many developers) is to roll our own (yet another) implementation of all that stuff, and deal with ks_string*, ks_bytes*, ks_array*, etc, instead of char*, whatever_t[], etc.

No simple solution here, and whatever we would choose probably won't be accepted by many C developers. Probably if we'll implement support for top 3 (or top 5) popular libs that will cover at least some popular options.

Exception support

As we all know, C does not have any standard exception support, and typical KS-generated code relies on them a lot, i.e.:

  r->src_port = read_u2be(io);
  r->dst_port = read_u2be(io);
  r->length = read_u2be(io);
  // ...

On every step, read_u2be might encounter end of stream (or IO error) and it won't be able to suceed parsing yet another 2 bytes. Typical solution for that in C is using return codes and passing value-to-fill by reference, i.e.:

int err;

err = read_u2be(io, &(r->src_port));
if (err != 0)
  return err;

err = read_u2be(io, &(r->dst_port));
if (err != 0)
  return err;

// ...

Since Go support introduction (#146), that became possible, although probably it still be a pain-in-the-ass to use in C :(

Another quick "solution" for C is to use signals/abortions to handle these erroneous situations. In fact, it would even be ok in many use cases like embedded stuff, because things are not usually supposed to blow up there and if they do, then everything is lost already, there's no graceful exists, user interactions, "Send error report to the vendor" dialogs, etc.

Stream abstraction

Relatively minor and solveable issue, but still an issue: what would be a concept of "KS stream" be in C? Two popular options:

Probably C runtime would need to implement all these options and allow for end-user to choose. Nothing too scary, but still an issue to be solved.

GreyCat commented 6 years ago

And, to answer these:

Having a C version would enable portability on any system

Well, I won't be that optimistic. Given all the stuff above, chances are tons of C people would still opt to roll things manually because of all these compromises and "does not exactly fit my workflow" argument.

and even more languages with C bindings

Probably it won't be that easy :( KS C runtime is likely to be easier to rewrite in another language than go through all that binding hassle, and then you'll have to do that "binding" glue code for every particular type ported.

Zorgatone commented 6 years ago

Hi thanks for the lengthy and detailed answer. I'm glad to hear that some discussion about C was already made, and considered. For the "string" argument I would go for "standard C" zero-terminated strings. Other "strings" that contain zeros in them I would tread them as binary data of given length. For the libraries to use (many that would encounter resistance) I'd go for custom implementation. That could be long to make but shouldn't be too hard to do (let me know if you want some help, I would be happy to do so).

For Exception support what about CException? See link Otherwise we could do something like C11's bound-checked string functions and return errno_t.

For the KS stream any of the two solutions would be ok. If I remember correctly you can set/enable the default buffering/buffer of FILE *. Otherwise allocate everything manually on memory and release it later.

About the "workflow" argument, everyone will always decide on their own what library to use or what to do with their own code (even doing all custom handling), so I wouldn't think too much about that.

For the "C bindings", it would be good for languages not yet implemented that can use the C bindings easily.

I think a good solution would be to have a kslib_init() and kslib_free() or something similar if the library needs to initialize and allocate/release its own resources. Even if it looks ugly or you have to save and pass around an extra arguments to the library's functions. Still better than nothing.

I believe it would be "uglier" to just have to make C functions "wrapped around" C++ API calls, or even worse not being able to compile on some systems, or having to implement everything (without this library) manually every time.

I like the project (even if I haven't had the chance to play around with it yet) and, if I have some extra time, I'd really like to give a hand and help to make a C port (even if it would be a side-project with some differences)

GreyCat commented 6 years ago

@Zorgatone Ok, for a start, I would suggest to really play around with KS and see what it does and what it does not. May be you'll decide that it won't meet your expectations anyway?..

For Exception support what about CException? See link

The link just says "Non-Image content-type returned" for me :( If you mean something like that — https://github.com/ThrowTheSwitch/CException — at the very least, that's +1 extra library of dependencies, and in C world every library is usually a major hassle. But may be that could be done too.

I'd really like to give a hand and help to make a C port

You've probably seen http://doc.kaitai.io/new_language.html — right now we're somewhere in between stages (2) and (3). From all the issues that I've outlined, this "totally different workflow expected" is definitely the most serious one. I'm not too keen on doing lots of work that almost nobody would want to use.

Zorgatone commented 6 years ago

Understandable, thanks for the reply. I was planning to do some testing with KS in the near future, maybe I will try and make my own library in C if I think I'll need it :)

PS: thanks for the link, it's a good starting point

KOLANICH commented 6 years ago

len_foo must not be stored in the structures that KS generates in memory at all — it must be used once during the parsing and then just thrown away

I don't use C, I use C++ and IMHO the preferred approach is not to store the info in a standalone structure, but to decompose the thing into a set of fixed (or variable size, if language supports it) dumb structures and put them upon raw virtual memory. #65

Given that we're talking about "string" data type, why not convert it into "pure C string", as most C stdlib functions expect it to be — i.e. no length information, just a zero byte termination.

for strz type just pass a pointer to that memory. There is issue with non-zero-byte terminators though.

Complex memory management

IMHO we should just use C++ for that. C coders can write in C++ in C-style if they want.

GreyCat commented 6 years ago

I'll just leave it here, just in case: https://matt.sh/howto-c

This link was heavily suggested by several modern C proponents that I've discussed KS support for C. Suggestions to modern C style guides are also most welcome. The only one that I know is Linux kernel coding style guide — this is my personal preference for C as well, but chances are that there are other popular style guides in other areas?

Zorgatone commented 6 years ago

@GreyCat nice link! Useful to know that. But still not all compilers support all the C11 features unfortunately. At least it should be good to use C99, especially for the stdint.h int types (I really didn't know about the fast and least ints! I knew about the fixed-size ones, though).

KOLANICH commented 6 years ago

Most of things from that are also valid for C++.

Zorgatone commented 6 years ago

I'm linking also another article with critics to matt's "how to c in 2016" article, to consider the other opinions as well: https://github.com/Keith-S-Thompson/how-to-c-response

arekbulski commented 6 years ago

For C strings, I would recommend that one field would end up adding few fields to resulting struct, with similar names and different types. For example:

  r->text_array = read_array(io, 10);
  r->text_str = r->text_array.to_str();

This does not consume more memory (only const amount), as the char pointer points to same data as the array. End user might want some glib arrays, or char, why not give them both?

GreyCat commented 6 years ago

@arekbulski Giving them both is probably a bad idea: it will require dependency on glib, and would add extra unneeded bloat for both parties. Besides, char* strings are just not enough anyway: you need to be able to do .length on that, and you just can't do that with char* string.

arekbulski commented 6 years ago

Another possible way (albeit not way too well-received by many developers) is to roll our own (yet another) implementation of all that stuff, and deal with ks_string, ks_bytes, ks_array, etc, instead of char, whatever_t[], etc.

You suggested using our own types, and it could provide convenience functions for transforming ks_arrays to glib bytearrays and other types. Hm? Glibc would be supported, not required.

arekbulski commented 6 years ago

@GreyCat I would be willing to start implementing the C runtime. If you would approve, then I would outline the runtime file first (the types and methods for bytearrays etc), and if that meets your standards, we (you) would update the compiler/translator to suport the runtime, and I would implement the meat in runtime. What do you think?

smaximov commented 6 years ago

Besides, char* strings are just not enough anyway: you need to be able to do .length on that, and you just can't do that with char* string

@GreyCat, you may consider rolling your own string implementation which uses the same technique as sds. This will make Kaitai strings compatible with most functions accepting char* (unless a Kaitai string contain an extra zero byte in addition to the terminating NULL).

GreyCat commented 6 years ago

@arekbulski Sure, go ahead :) I'm not sure you've seen it, we also have this article with an overall new language support plan and an implementation graph like this one.

GreyCat commented 6 years ago

@smaximov Yeah, that's probably how it should be done for "roll your own" implementation.

arekbulski commented 6 years ago

I have sweat sour feelings about SDS. I really like the idea, I really do, but the implementation is horrible. The repo you linked has bug reports and bugfixes going back 4 years and still hanging. They also implemented variable-length prefix (the count field) which makes it bananas. We can implement our own SDS, I do not recommend using theirs.

Big thanks for sharing this with us, @smaximov !

jonahharris commented 6 years ago

Is anyone working on this, even as a prototype?

GreyCat commented 6 years ago

Not really. Personally, I would probably return to this one after #146, as experience with Go is very much the same as with C (except for the fact that Go relatively ok strings and slices).

arekbulski commented 6 years ago

I promised to implement the C runtime, but that was few months ago. Since then I had much work on Construct, and now I am working on few things in KS. I am still willing to implement this, but I cant work on everything at once. If you wish, then I will get on top of C but other work items would need to be shelved instead.

DarkShadow44 commented 5 years ago

Any updates on this? I'd like to help, but I'm not familiar with scala...

GreyCat commented 5 years ago

No updates. Unfortunately, most of https://github.com/kaitai-io/kaitai_struct/issues/263#issuecomment-331869391 still stands. It's probably still a good idea to complete Go port first, as it is shares many common concepts (except for the hassle with memory management).

DarkShadow44 commented 3 years ago

FWIW, I have a (for me) working C version at https://github.com/DarkShadow44/UIRibbon-Reversing/blob/master/tests/UIRibbon/parser_generic.c https://github.com/DarkShadow44/UIRibbon-Reversing/blob/master/tests/UIRibbon/parser_uiribbon.c It's pretty simple, copying data from the file into an in-memory struct. It also support writing data. What do you think about that approach? It might not fulfill all use cases, but to me it does the job.

KOLANICH commented 3 years ago

BTW, I have a half-finished (but not yet published, development stalled because I got other tasks) proposal of how it should look like for C and C++ for one damn simple spec .

In general:

DarkShadow44 commented 3 years ago

Would you have an example of how that C code would look like? I don't quite understand the "private structures" bit. In my example, all structs are public. I don't really do streams either, it's an in-memory stream abstraction. How do you do memory mapping in standard C?

KOLANICH commented 3 years ago

Would you have an example of how that C code would look like?

I have said that it is unfinished. But I'd create a small example just now illustrating what I mean, but without any guarantees of correctness.

I don't quite understand the "private structures" bit. In my example, all structs are public.

Very easy

struct a{
  uint64_t *c;
};
struct a_priv{
  uint64_t c;
};
struct a_full{
  struct a pub;
  struct a_priv priv;
};

struct a * construct_a(){
  sruct a_full *a = (struct a_full *) malloc(sizeof(a_full));
  a->pub.c = & a->priv.c;
  return (struct a *) a;
}

void process(struct a * c){
  *(c->b) = 42;
}

This way we access the data only via pointers, so we access it uniformly no matter where are they. It is at cost - there is overhead, a pointer per a var. It is possible to make it more efficient by keeping only pointers to structs, not to every fields, but in C it will cause the API being terrible and sufficiently different from it in other langs. In C++ it can be fixed by operator override and constexprs.

I don't really do streams either, it's an in-memory stream abstraction.

I guess some libc can implement fread fseek fwrite API over mmaps.

How do you do memory mapping in standard C?

Standard C doesn't even have any sane functions to work with strings. It is an extremily bad too -fpermissive stagnating language (once I was debugging a memory-safety issue for quite a long time .... because C compiler almost silently (with a warning, but who looks at warnings in a project that is already filled with warnings?) allowed to pass an incompatible type as an arg (or maybe I missed an arg, I don't remember exactly)). IMHO there is no sense to use C where C++ can be used. Usually when I see C fans, I see the inacceptable shitcode. The only real way to fix that shitcode ... is to implement a kind of OOP myself above plain C. I prefer to just use C++, but there are some projects created by C fanatics (in the sense I have told above, the projects are full of shitcode) I had to contribute to.

DarkShadow44 commented 3 years ago

This way we access the data only via pointers, so we access it uniformly no matter where are they.

I don't really see the point behind that, tbh. What's the disadvantage of my approach? I don't need everything as pointers.

but in C it will cause the API being terrible and sufficiently different from it in other langs.

Sure, the API will be different, but that's because C is not OOP. That doesn't necessarily makes it terrible. As you see, my implementation uses an OOP abstraction as well, where's the problem with that?

Standard C doesn't even have any sane functions to work with strings

Yea, that's why I just keep strings as-is.

KOLANICH commented 3 years ago

What's the disadvantage of my approach?

It is just a different approach designed with different things in mind. When I was designing my approach I was thinking about making serialization cheaper and easier and about reducing memory footprint by not copying data at all, and about volatile structures in memory to be used for IPC and to control devices mapped to memory.

DarkShadow44 commented 3 years ago

For memory footprint it would be enough to only keep big data blobs memmapped, everything else is smaller than the size of a pointer. Anyways, for memorymapping we need platform specific code anyways, right? I propose some kind of stream abstraction (similar to what I made), which can be in-memory (like mine) or memory mapped.

I was thinking about making serialization cheaper

How does that make serialization cheaper? Sure, you can edit files as-is, but when writing new files it makes things harder.

about volatile structures in memory to be used for IPC and to control devices mapped to memory.

That's an interesting point, I thought we only care about file formats. Can the other languages handle (especially C++) handle something like this?

KOLANICH commented 3 years ago

For memory footprint it would be enough to only keep big data blobs memmapped, everything else is smaller than the size of a pointer.

Yes, and these data blobs can be continious structures of fixed size that larger than a pointer. So setting a field is just setting it by a pointer *(a->b) = c; in that case. The problem is with variable-size structures, the offsets are not known in compile time. We will have to split them into chunks of constant and non-constant size, so *(a->constSize0.b) = newB;set_A_variableSizeFieldC(&a, newC);. It is extremily inconvenient to have this mess. So we would have to wrap everything into accessor functions, and fortunately it can be made efficient, since the functions can be inlined in a compiler-specific way (and there exists an awesome abstraction layer called Hedley), but using acccesors is still not very convenient. In C++ everything can be transformed into assignments and templates that would be optimized out.

How does that make serialization cheaper? Sure, you can edit files as-is, but when writing new files it makes things harder.

Not everything (changing a variable-length field in a way changing offsets not in a multiple of page sizes may require relocating everything after), but minor changes into constant-size structures should be cheap. Currently

Can the other languages handle (especially C++) handle something like this?

C++ can handle everything C can, cannot it? I also guess Rust can do the same, such it is positioned to replace C in system programming and has packed structs. In all other languages having memory maps it is also possible, i.e. here is an example for python:

https://github.com/KOLANICH-tools/FrozenTable.py/blob/0e983ab2b4afc1c80e0afef51d9a48b46cbbf1c0/FrozenTable/BinPatchTools/ExecutableFormat.py#L39L40

https://github.com/KOLANICH-tools/FrozenTable.py/blob/0e983ab2b4afc1c80e0afef51d9a48b46cbbf1c0/FrozenTable/__init__.py#L219L219

DarkShadow44 commented 3 years ago

The problem is with variable-size structures, the offsets are not known in compile time.

I don't see a problem here, mind giving an example? Maybe I misunderstand what you mean with "variable sized structures". In my usage cases, I didn't find a need for accessors.

Not everything (changing a variable-length field in a way changing offsets not in a multiple of page sizes may require relocating everything after), but minor changes into constant-size structures should be cheap. Currently

The questions is what the main use-case is. How important is editing a file, is overwriting it with a new file not good enough?

C++ can handle everything C can, cannot it?

No, I meant whether this functionality is already part of the C++ implementation of KS.

KOLANICH commented 3 years ago

How important is editing a file, is overwriting it with a new file not good enough?

What if a file is large? What if a file is in flash with limited count of writes? What if the file is not very large, but the actual storage is in another side of the world and the file is implemented via a networked FUSE filesystem?

No, I meant whether this functionality is already part of the C++ implementation of KS.

No, it isn't. C++ runtime is stream-based, not memory-based.

DarkShadow44 commented 3 years ago

What if a file is large? What if a file is in flash with limited count of writes?

Honestly, most official tools dealing with the files can't edit those files either. It's mostly a re-write as well. I'd consider that niche cases, although that might be up for discussion. Editing files directly is quickly becoming complicated, unless you only allow editing values without adding/deleting. I guess that could be useful as well, but as I said, sounds like niche cases to me.

DarkShadow44 commented 3 years ago

Would pull requests be considered or is still further planning/discussion needed here?

KOLANICH commented 3 years ago

Discussion is needed IMHO.

Unfortunately, I have accidentially deleted the prototype designing which I have spent several hours, but I remember the main ideas there.

IMHO it should be not SAX-like parsing, but OOP over C with kinda virtuality via pointers. For each KS-structure KSC must generate

  1. the packed struct of actual data. Can be layed out over actual memory. In the case of variable-length fields, including the fields that can be missed, a struct is cut by them into parts having constant length. All the complexity related to accessing fields lives in accessors.
  2. the header struct of headers to each nonprimitive field and to each non-constant-offset packed struct. Each header contains in its beginning a pointer to the beginning of memory chunk it manages, and maybe the length of whole chunk, if it is variable-length.
  3. multiple ctors
  4. a dtor
  5. accessor functions
  6. functions returning sizes and offsets of each field and the whole structure
  7. parser function
  8. the materialization function!
  9. the dematerialization function!

All of these must follow a certain naming convention allowing prediction of function names.

When parsing, the packed structs are layed over raw memory. Then the header struct is populated with pointers. Then these pointers can be used to access (read, write) raw values in memory.

But now serialization comes into play.

  1. A serialization ctor allocates a buffer of size sizeof(ptr_header) + sizeof(packed) and populates the header.

  2. Then we can put there data by the pointers. The data is put into the buffer after the header.

  3. Somewhen we may want to put the data into the file - a memory mapped buffer. Then the memory chunk is memcpyed to the mapped memory, the pointers in the header is reassigned to it and the object is truncated realloc. Of course it can be used to put data not only into an mmaped file, but to an arbitrary buffer, i.e an element of an array. We call this procedure materialization.

  4. the inverse of materialization is dematerialization. There is a bit difference from construction - the data can be put not after the header, but by arbitrary offset. And the first pointer in the header points exactly to that offset.

Here is some code and the corresponding spec designed with these ideas in mind, but not exactly. I had another code and another spec which were purportedly designed the way to see how should KSC-generated code look like with all the verbosity, but they were lost. The code by the link was designed primaraly to solve the problem, not to invent the design of C runtime.

DarkShadow44 commented 3 years ago
  1. the packed struct of actual data. Can be layed out over actual memory
  2. the header struct of pointers to each field of the struct

I'm still not really convinced of the idea with the pointers. How common is this usecase really? It does make things are lot more complicated.

I'd still prefer the simple structure, where KS reads the data into structs, and then you have one function to free that struct again.

  1. multiple ctors
  2. accessor functions
  3. functions returning sizes and offsets of each field and the whole structure

What exactly do we need those for? In my approach, I just malloc the structure, fill it, and later clear it. Accessors should be covered by accessing the struct directly, no? Sizes and offsets, maybe. What for exactly? The real size can change when it contain conditional fields, no?

In short, my suggestion is: 1) One struct for each kaitai type 2) Single structs inside other structs don't need to be allocated 3) Variable size arrays of structs inside structs need to be allocated 4) Instances need to be allocated as well 5) Parser reads the data straight into the struct, and returns the main struct 6) Main struct can be destroyed by calling its "destructor" 7) "Stream" objects to be operated on, providing functions like "read uint32" or "read string", also doing size checks 8) returning error code for each function, using macro for quick checking and returning error of called functions

I just don't see the need for another indirection like accessors or making everything in the structs a pointer. It would make the entire API more complex and cumbersome to use. Are big files / IPC / device IO important enough to justify that? Or maybe we can add that later as an option?

Btw, anyone else who want's to chime in? There seems to be quite a bit of interest, so why not join the discussion? :)

KOLANICH commented 3 years ago

It does make things are lot more complicated.

It is just to enable smarter serialization (editing) without writing the same stuff when it is not needed to be moved (I wonder if OSes have API to rearrange pages on disk so avoid rewriting data at all, only metadata, in the cases when it is needed to rearrange page-aligned page-sized structures within files)

Accessors should be covered by accessing the struct directly, no?

yes and no. yes because in some cases yes. No because let's assumme that there is a variable-size array in between of 2 fields, and you need to add an element there. Initially when mmaped it is in materialized state. It is inefficient to move the rest of struct for each appended element, so the writing accessor for an array dematerializes that struct if it was materialized, so the array is moved into a separately-allocated buffer that can be realloced if further addition of elements happen. But you need a uniform interface for these all be useful in your own software. Accessors are meant to provide such an interface hiding the compiler-generated complex logic behind them, so your software doesn't have to carry this logic in itself.

Or maybe we can add that later as an option?

Of course such a target can be added later as a yet another target.

It would make the entire API more complex and cumbersome to use. Are big files / IPC / device IO important enough to justify that?

As I have already said multiple times, the current spec of Qt Installer Framework compiled into C++ app consumes (at least consumed, because Qt Company has retired offline installers and I used that spec to unpack Qt offline installer for Windows on Linux, since it didn't work in Wine) 12 GiB for a 2 GiB file being parsed. It is extremily strange

DarkShadow44 commented 3 years ago

It is just to enable smarter serialization (editing) without writing the same stuff when it is not needed to be moved

And I don't think that's necessary. I don't know of any application that does that, most usecases are perfectly fine with rewriting files completely. Even big files like archives are usually re-created every time.

No because let's assumme that there is a variable-size array in between of 2 fields, and you need to add an element there. Initially when mmaped it is in materialized state. It is inefficient to move the rest of struct for each appended element, so the writing accessor for an array dematerializes that struct if it was materialized, so the array is moved into a separately-allocated buffer that can be realloced if further addition of elements happen

I honestly didn't quite understand that de/materialize thing yet. Anyways, for that usecase, you have to break with the whole "editing the file without rewriting", at least partially. You either need to rewrite the whole file, or rewrite the main seq and then move one (ore more) instances afterwards, updating the pointers accordingly. Would be a lot easier to just rewrite the whole file from scratch at this point, because I wouldn't want to deal with those complications.

As I have already said multiple times, the current spec of Qt Installer Framework compiled into C++ app consumes (at least consumed, because Qt Company has retired offline installers and I used that spec to unpack Qt offline installer for Windows on Linux, since it didn't work in Wine) 12 GiB for a 2 GiB file being parsed.

That's only for big files, right? For those I propose a different approach: For variable byte arrays (which should be pretty rate), we could add an opaque byte_blob type into the structs. This would need to be handled with accessor functions. Behind that opaque handle it could be backed by memory or a file descriptor. Meaning, when you parse a file and pass the according flag, it wouldn't be completely read into memory, but instead store the position and size of the blob in a struct and later read it from there. Then big files, usually archives, would keep the big part on disk, while still allowing you to use it almost like normally. Would also avoid memmap, since we could use normal FILE descriptors. Would that address the "big files" problem?

KOLANICH commented 3 years ago

I honestly didn't quite understand that de/materialize thing yet.

Very easy conception. In DBs "materialized" means "stored on disk". A usual view is just a way to reuse a select query. A materialized view just queries a query and puts its results into a table. So it is a kinda cache for complex queries.

How I used the word "materialized" here. Let's assumme we have a mmaped file. It occupies some address space. Then we use the code transpiled from a spec (I guess we need to invent a term for it) to parse it. It creates some "headers" which pointers point into the address space of the original buffer - the mmaped file. If we write by these pointers, it would cause write to disk and other software mapping this file will see the changes immediately if it read it. When we "dematerialize" an object we just allocate a buffer in our process heap and copy the corresponding struct bytes there and change the headers to point there. Now we can change that object without the need of moving everything after it.

I don't know of any application that does that, most usecases are perfectly fine with rewriting files completely. Even big files like archives are usually re-created every time.

Do they really? Deleting files from archives is usually fast, so is renaming them.

Anyways, for that usecase, you have to break with the whole "editing the file without rewriting", at least partially.

We certainly "break from "editing without rewriting", at least partially" when we rewrite a single integer - that integer is the part being rewritten!

You either need to rewrite the whole file, or rewrite the main seq and then move one (ore more) instances afterwards, updating the pointers accordingly.

How much of file must really be rewritten depends on the way file is modified, so it is the generated code should decide that automatically.

Would be a lot easier to just rewrite the whole file from scratch at this point, because I wouldn't want to deal with those complications.

Yeah, it can be easier to implement only parsing first. I don't even ask you for implementing my ideas, you are the one writing the code - it's up to you to decide what and how to implement. I just think it may make sense to remember about serialization when implementing parsing, so serialization code and the code relying on kmaped access can be added into that backend later. But it is important to design the things the way that fitting these additions (while maximally possible reusing the code) wouldn't cause breakage of API. Supporting hardware without MMU is important, and probably the backend development should start from a non-MMU case, but, please, remember about the mmaped case and design the things the way to make them reusable for all the cases.

Meaning, when you parse a file and pass the according flag, it wouldn't be completely read into memory, but instead store the position and size of the blob in a struct and later read it from there. Then big files, usually archives, would keep the big part on disk, while still allowing you to use it almost like normally. Would also avoid memmap, since we could use normal FILE descriptors. Would that address the "big files" problem?

Depends on the structures used within files. If the data in them is not opaque blobs, but also should be parsed and accessed, if I understand your description right, it won't help.

DarkShadow44 commented 3 years ago

Very easy conception. In DBs "materialized" means "stored on disk". A usual view is just a way to reuse a select query. A materialized view just queries a query and puts its results into a table. So it is a kinda cache for complex queries.

Thank you for the explanation, that makes sense!

Do they really? Deleting files from archives is usually fast, so is renaming them.

Well, I did a quick test with a 3GB zip with explorer, and 3GB zip/7z and 7zip - delete and rename to a shorter name. Both programs take a bit and definitely make a 3GB copy - Resource Monitor confirms that. I think it's just that SSD is so quick it looks fast. I also worked with game archive un/packers in the past, and they lacked in place editing as well. Although I'd love to see some applications who do it without copying, so please tell me. Point stands, IMHO, when most programs get away with that for big files, so can we.

I don't even ask you for implementing my ideas, you are the one writing the code - it's up to you to decide what and how to implement. I just think it may make sense to remember about serialization when implementing parsing, so serialization code and the code relying on kmaped access can be added into that backend later.

Well, it's not necessarily that I don't want to, it's that it seems to make the api to complicated to me. When we do the automatic dematerialization we need to support more use cases like 1) Adding an element to a variable length array 2) Setting a flag that causes another optional value to be used That means we would access everything with accessor methods, no? Like

ks_set_uint32(data->value, 5);
uint32_t value = ks_get_uint32(data->value);

I'd prefer if it was like

struct accessor_uint32_t {
    uint32_t (*get)();
    void (*set)(uint32_t value);
}
struct my_data {
    accessor_uint32_t value;
}

But AFAIK that's not possible in C.

Simplest would be to disallow operation that change the (file) size on materialized objects. That would allow us to get rid of accessors and work directly with the pointers in the struct. For big files we'd still have the blob idea to reduce memory usage. Then we could get away with an API usage like

*data->value = 5;
uint32_t value = *data->value;

which looks a lot easier to me. That's what you proposed in the beginning, right? Of course, we would trade complexity against features and have undefined behavior if users don't follow the "do nothing that changes file size without dematerializing first" rule.

Depends on the structures used within files. If the data in them is not opaque blobs, but also should be parsed and accessed, if I understand your description right, it won't help. Correct. But when we need to parse them, we'd need pointers for all of them anyways, so your approach doesn't save space either. Correct me if I'm wrong.

In short, I just want an API that's simple and clean to use. Accessor functions for everything would make it harder to consume, and pointers also introduce new room for errors (missing *). My simple approach without inline rewrite just has plain structs, so no missing pointer operators. Thoughts?

KOLANICH commented 3 years ago

I just want an API that's simple and clean to use.

Ones who want complex things look simple, just use C++. Old C is about being as much explicit as possible (though it fails in it, I remember the case I have wasted 2 hours because C compiler was OK with an incorrect call of a function). Well, if one wants to be explicit, noone forbids to write in C++ using C style. Yeah, the compilation is slower, but IMHO there is no reason not to use C++ instead of C where it is needed (and C++ increased strictness is almost always needed) except of being demanded to use it by project policy.

DarkShadow44 commented 3 years ago

Well, there's enough projects that are plain C to prevent us from using C++ code for the C code. Anyways, any comments for the other topics?

EDIT: Since inline rewriting should be a rarer case, I propose a different approach:

struct ks_handle
{
    ks_stream* stream;
    void* struct_data; // Pointer to struct itself
    func_write_struct; // What function to call to write that struct back
    int last_size; // To make sure it's not bigger than before
    int real_pos; // Where to write
}
struct my_data
{
    ks_handle ks_handle;
    int value1;
    // ...
}

ks_update_internal(ks_handle *) {
// Writes that struct and all descendents back into the stream (which might be mmaped), also does size sanitity check and potentially moves stuff around to make space.
}

// Usage:
my_data* data = ks_parse_file(/* ... */);
ks_update_internal(data->ks_handle);

How does that sound?

KOLANICH commented 3 years ago
ks_update_internal(data->ks_handle);

it'd pass the struct by value. It may be OK, or not - I don't currently have any strong opinion about that.

so, why not just

ks_update_internal(&data);

?

int value1; // ...

I guess it should be a separate struct, and then aggregated using a struct of 2 fields.

void* struct_data; // Pointer to struct itself

Is it a pointer to struct ks_handle or struct my_data or the "tail"? Why is it void*, and not struct ks_handle * in the first 2 cases and why not

struct header_and_generic_tail{
    struct ks_handle header;
    uint8_t tail[];
};

and so tail just points to the tail and when we know the type can be casted to the needed type of the tail structure

struct header_and_generic_tail *struct_data;
....
((concrete_struct*)struct_data->tail)->...

, in the third case?

func_write_struct; // What function to call to write that struct back

If we really want to imitate virtual functions, it probably should be a virtual table and not a single function.

DarkShadow44 commented 3 years ago
ks_update_internal(&data);

Well, considering the first element of the struct is always at the same address of the struct, this would work. But then ks_update_internal would need to take void*, and I didn't do that for a bit of type safety.

int value1; // ...

I guess it should be a separate struct, and then aggregated using a struct of 2 fields.

Not sure I understand, please elaborate?

void* struct_data; // Pointer to struct itself

Is it a pointer to struct ks_handle or struct my_data or the "tail"? Why is it void*, and not struct ks_handle * in the first 2 cases

It's a pointer to the my_data instance, so we only need to pass the handle when telling the runtime to rewrite that struct. It can't be my_data* directly, since it could be any struct. It just gets passed to the func_write_struct, which does the cast to the structure it deals with. Not sure what exactly you mean with tail here.

If we really want to imitate virtual functions, it probably should be a virtual table and not a single function.

It's not really about virtual functions, it's just to tell the struct what function can write it. But sure, we could also put it into a struct and add read func (and maybe others).

KOLANICH commented 3 years ago

Not sure I understand, please elaborate?


__attribute__((packed))
struct my_data_raw{
    int value1; // ...
};

struct header_and_generic_tail{
  struct ks_handle header;
  uint8_t actual_data[];
}

struct my_data_with_header{
  struct ks_handle header;
  struct my_data_raw actual_data;
};

bool is_newly_created(struct ks_handle* h){
    return (uint8_t*)h->struct_data == (uint8_t*)((struct header_and_generic_tail*)h)->actual_data;
}

/// Or maybe just store a boolean flag if a value is materialized, but I guess the pointer to the main struct (or maybe parent struct) may be useful.

bool is_materialized(struct ks_handle* h){
    return h->main && (uint8_t*)h->struct_data >= (uint8_t*)h->main.range.start && (uint8_t*)h->struct_data < (uint8_t*)h->main.range.stop;
}

Then

ks_handle * my_data_ctor(){
    struct my_data_with_header *b;
    b = malloc(sizeof(struct my_data_with_header));
    memset(b, 0, sizeof(struct my_data_with_header));
    b->header.struct_data = &b->actual_data;
    return b;
}

struct ks_handle *h = =my_data_ctor();
tstruct header_and_generic_tail *a = h;
((struct my_data_raw*)a->actual_data)->value1  //works only for newly-allocated structures
((struct my_data_raw*)h->struct_data)->value1  // must work always

or alternatively:


struct header_and_generic_tail a;
struct my_data_with_header *b = &a;

b->actual_data->value1;

depending on which is more convenient in the situation.

But then ks_update_internal would need to take void*, and I didn't do that for a bit of type safety.

Type casting can be done internally.

It's a pointer to the my_data instance, so we only need to pass the handle when telling the runtime to rewrite that struct. It can't be my_data* directly, since it could be any struct.

Why not to just pass the pointer to the ks_handle instead, and store the pointer to my_data_raw instead?

Not sure what exactly you mean with tail here.

my_data_raw in the example from this messagge, if I understand right that value1 is supposed to be layed out over raw memory / store the same data as in raw memory.

DarkShadow44 commented 3 years ago

What exactly is the advantage of having that "header_and_generic_tail"? Keep in mind that I don't really go for that "materialized view", but I have functions for rewriting parts, if needed.

Type casting can be done internally.

No, I mean type safety that is checked by the compiler.

if I understand right that value1 is supposed to be layed out over raw memory / store the same data as in raw memory.

Yes, that's what it is supposed to be.

DarkShadow44 commented 3 years ago

Just FYI, I'm currently developing my version in my forks or compiler/tests and the runtime. It goes a lot slower than I hoped, since I need a lot of time to understand the existing architecture/framework. Anyways, I'm slowly working to turn a copy of the C# compiler into a C compiler. Not sure when I finish, but when I do, I'll post it for review. If you're interested in helping, feel free to head over to my profile.

generalmimon commented 3 years ago

@DarkShadow44

I'm currently developing my version in my forks or compiler/tests and the runtime.

In case you missed it, we recommend following https://doc.kaitai.io/new_language.html closely, i.e. first get to know Kaitai Struct well enough (obviously https://doc.kaitai.io/user_guide.html; I also recommend going through test formats to make sure you understand all concepts).

The 2nd step is to plan how to map seq, instances, enums and types (even nested types), primitive types, byte arrays, strings, streams, etc. Also try to gather coding standards and naming practices for the language and follow them.

Then put together some basic runtime library for C and "compile" hello_world.ksy manually, i.e. by hand.

Anyways, I'm slowly working to turn a copy of the C# compiler into a C compiler.

Note that it's not recommended or necessary to touch the KS compiler at all before you finish all previous steps. Implementing the automatic compilation support in KSC for the C language should be just icing on the cake, because it's been already outlined and generally agreed on how KS concepts will map into the C language, the C runtime library has been mostly done and the manually assembled hello_world.ksy C module looks good. We're not yet there, if I'm not mistaken.


Also a note to the previous discussion between @DarkShadow44 and @KOLANICH: in my opinion, it makes no sense at this point to think about how serialization could work in C when basically nothing from the task list has been done and we don't even have any deserialization for C. I think this "head in the clouds" and the constant effort to address potential future problems (that may or may not become relevant in like 5 years from now, but only if someone ignores the irrelevant discussion and starts addressing the basic and perhaps not so attractive stuff first) instead of solving current basic problems is what holds back implementation of many features on Kaitai Struct.

DarkShadow44 commented 3 years ago

In case you missed it, we recommend following https://doc.kaitai.io/new_language.html closely, i.e. first get to know Kaitai Struct well enough (obviously https://doc.kaitai.io/user_guide.html; I also recommend going through test formats to make sure you understand all concepts).

Yeah, at this point I think I know enough to comfortably get to an implementation. Though I'm not entirely sure how exhaustive the test cases are, for example, I didn't find something with instance in instance in instance.

The 2nd step is to plan how to map seq, instances, enums and types (even nested types), primitive types, byte arrays, strings, streams, etc. Also try to gather coding standards and naming practices for the language and follow them.

I got most of that already planned out, I mean I already got a half-working python based compiler.

Note that it's not recommended or necessary to touch the KS compiler at all before you finish all previous steps. Implementing the automatic compilation support in KSC for the C language should be just icing on the cake, because it's been already outlined and generally agreed on how KS concepts will map into the C language, the C runtime library has been mostly done and the manually assembled hello_world.ksy C module looks good. We're not yet there, if I'm not mistaken.

Except for arrays and strings I should have it mostly planned out. I planned to work on those on the fly... For the runtime I have most of the concepts already, just the port from my python impl to kaitai is outstanding. With a few changes here and there. Regarding the hello_world.ksy, that should compile already, I should be able to provide this. If that would help?

I'll admit that I usually just get to work and figure things out while working on it, although I wouldn't mind discussion when there's someone to talk to. But except for KOLANICH there wasn't much input so I just decided to get started.

in my opinion, it makes no sense at this point to think about how serialization could work in C when basically nothing from the task list has been done and we don't even have any deserialization for C.

Well, from my perspective it does make sense, since I'd need both until I can properly use the compiler for my project. Although I also don't know if an approach for serialization in C might just be rejected because it doesn't fit the vision/structure of kaitai struct. I got most of it worked out in theory, although I don't know if that's okay for you as well.