kaitai-io / kaitai_struct

Kaitai Struct: declarative language to generate binary data parsers in C++ / C# / Go / Java / JavaScript / Lua / Nim / Perl / PHP / Python / Ruby
https://kaitai.io
4.01k stars 196 forks source link

C++: support modern standards like C++11 / C++14 / C++17 #211

Open GreyCat opened 7 years ago

GreyCat commented 7 years ago

Following a discussion in #209, there seems to be some interest in developing a C++ target that will work with all the might options that modern C++ offers us, like std::unique_ptr, std::shared_ptr, std::move, removing direct invocations of new and delete, allocating all the stuff mostly on stack instead of on heap, etc, etc.

I invite everyone interested in participating in discussion and getting a solid plan ready, what exactly should change, how declarations, invocations, object creation & destruction, usage, etc, should look like.

Cc @scottmsilver @JaapAap

GreyCat commented 7 years ago

Let me start it with defining some areas that need attention. From what I see, things that are bound to change are:

scottmsilver commented 7 years ago

I am, by no means, an expert.

My 2 cents

On Wed, Jul 19, 2017 at 5:43 AM Mikhail Yakshin notifications@github.com wrote:

Let me start it with defining some areas that need attention. From what I see, things that are bound to change are:

  • Data types used, declaration
    • Usage of dumb pointers vs references vs smart pointers
    • Wrapping of various stuff in std::vector
  • Initialization / constructor, basic initialization
  • _read() method, instantiation of types
  • Destruction and cleanup (ideally, there should be none — I wonder if that's possible)
  • Usage (i.e. instead of using foo()->bar()->baz() there would be foo().bar().baz() or some combinations?)
  • Exception handling (?)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaitai-io/kaitai_struct/issues/211#issuecomment-316373637, or mute the thread https://github.com/notifications/unsubscribe-auth/AK76SaF7Wlyz8ZFma1fcHT-lncpIyQogks5sPfnfgaJpZM4Ocqo5 .

GreyCat commented 7 years ago

Exceptions are already widely used in C++ runtime, and it's one of a major design choices, i.e. it's unlikely to change.

I'm working on the leaks, see #213 and hopefully soon we'll see progress in #209.

Newlifer commented 7 years ago

I second this!

GreyCat commented 7 years ago

To make this discussion more to the point, I challenge everyone who wants to take part, to port some of our existing tests to C++11/14/17/whatever. "hello_world.ksy" is probably trivial and not that interesting, but I'd love to see stuff like:

berkus commented 7 years ago

"Port" in what sense - the ksy generator or how the generated code should look like as a cpp/h?

GreyCat commented 7 years ago

How the generated code should look ;) I'll do the fixes to ksc myself :)

berkus commented 7 years ago

Ok, I will give it a shot tomorrow/day after.

berkus commented 7 years ago

https://gist.github.com/berkus/c12f110d12bd71b7a31394e182e67e3f this is a first approach (perhaps not entirely working) to buffered_struct.

JaapAap commented 7 years ago

Just wondering: you opted for shared_ptr. Is this really needed (or useful, and worth the associated performance impact)?

The existing interface has all ownership with the entities derived from kstruct - is it really needed to extend the lifespan of the parsed data beyond the lifetime of the kstruct decedents?

berkus commented 7 years ago

It's not required, but it greatly simplifies destructor code and makes it near impossible to make ownership errors.

Of course if entire ownership is contained in the structure it should work with manual pointers just fine.

The nav_parent example is more complicated; and additional problem there is it's not possible to use shared_ptrs there :) so plain pointers might be better.

berkus commented 7 years ago

One extra thing that might be worth doing is move all the boilerplate initialization code (constructors and getters) to headers, while leaving only _read() methods in cpp file. If you do keep plain pointers then dtors will also stay in cpp.

berkus commented 7 years ago

https://gist.github.com/berkus/9e656fcc2631a13614f19bc85746365d it's a sketch for nav_parent example. Notice that m__root could not be made a weak/shared_ptr for the outer class, as it would require shared_from_this() init call in ctor.

This is definitely not final and could be improved further.

GreyCat commented 7 years ago

@JaapAap @berkus Note that there might be two use cases for C++ classes:

  1. Read-only model: kstruct-based objects are created from parsing, thus _read() is essentially a part of constructor. Eagerly read members (i.e. seq attributes) are immutable and have only getter. It is generally safe to assume that whole hierarchy of kstruct-based objects that might be created due to parsing process is owned by root kstruct-based object and it's generally ok to delete everything as that object disappears.

  2. Read-write model: it is possible to build objects manually and then call something like _write() to serialize them to the stream. Probably typical use case would be something like that. Given this KSY:

meta:
  id: foo
seq:
  - id: coord_x
    type: s1
  - id: coord_y
    type: s1
  - id: comment
    type: comment_block
types:
  comment_block:
    seq:
      - id: contents
        type: strz
        encoding: UTF-8

One might want to use it like that:

kaitai::kstream* my_stream; // somehow initialized the stream and readied it for writing

comment_block_t my_comment();
my_comment->set_contents("Some string");

foo_t my_obj();
my_obj->set_coord_x(100);
my_obj->set_coord_y(100);
my_obj->set_comment(&my_comment);

my_obj->_write(my_stream);

Note that there are two objects and the inner one is created by calling code, and I guess it's generally good idea to assume that it's owned by that code as well (i.e. it would be cleaned up by caller, for example, with function's stack cleanup).

JaapAap commented 7 years ago

I suppose that the caller could also hand over ownership to the comment_block_t, so a unique_ptr could be used? It might simplify things if comment_block_t always has/takes ownership?

JaapAap commented 7 years ago

My remark on use of shared_ptrwas actually meant to discuss whether using unique_ptrinstead would be a good option instead - I was not thinking of plain raw pointers here. I still have the feeling that essentially replacing all owning pointers by a unique_ptr, and keep all non-owning pointers to be raw pointers (or, in C++17 terms, alternatively use observer_ptr to document more explicitly that these pointers are not owning) should be able to get optimal performance and the same functionality that is present now.

This would also be in line with the C++ Core Guidelines, to be found for example here: http://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#S-resource

I don't consider myself an expert and would be happy to be convinced that this is not a good (or suboptimal) solution.

berkus commented 7 years ago

unique_ptrs are better than shared_ptrs if possible to use, but i'm not sure how smart is ksy compiler to figure out ownership - need more info.

I'd keep non-owning ptrs as weak_ptrs though, so that they are automatically cleaned up. This means using shared_ptrs.

JaapAap commented 7 years ago

My take on is it that he compiler has this knowledge available – it uses it now to do memory allocations and de-allocations. Indeed, the knowledge will now be needed in multiple locations during the code-generation process though, but similarly, this knowledge would be required to determine where to use shared_ptr and where to use weak_ptr instead?

What do you mean by ‘automatically cleaned up’ for weak_ptrs? Are they that different from the raw pointers obtained by calling ‘get’ on a unique_ptr?

berkus commented 7 years ago

@JaapAap weak_ptr will be nullptr when it's "parent" shared_ptr is destroyed. Raw pointers don't become nullptr automatically.

JaapAap commented 7 years ago

What I meant is that calling .get() on a unique_ptr returns either the underlying raw pointer to some pointee, or a nullptr when the unique_ptr is not (or no longer) associated with a pointee. So the behavior is very similar, right?

berkus commented 7 years ago

Yes, but you cannot have a unique_ptr to the parent.

JaapAap commented 7 years ago

I see your point. I suppose the current code assumes/relies on the fact that the parent is never accessed after it has been destroyed, but possibly incorrectly? I am still trying to get my head around the logic behind the allocation/destruction order.

GreyCat commented 7 years ago

In current code, "parent" is assumed to always exist when child objects are alive. On destruction, every object destroys everything that it allocated, thus it usually includes all the objects that were allocated with new some_type_t(..., this, ...).

JaapAap commented 7 years ago

Thanks - would that imply that all objects are essentially destructed (in some sequence) in one go (that is, there are no calls accessing other objects during this sequence)? Would you know of any potential situations in which the assumption would get violated?

GreyCat commented 7 years ago

All object destruction code (i.e. delete operators) in generated C++ files is generated only in destructors. There are no stray delete statements elsewhere. One instantiates whole KSY (or part of that KSY, i.e. a subtype) via new / stack allocation in one go, and it generally gets destructed in one go (as stack goes out of scope, or the one who done new calls delete on main object).

JaapAap commented 7 years ago

Just as a try, I have done what I previously proposed: replace all 'owning' raw pointers by unique_ptrand leave all 'non-owning' pointers untouched. The results (which do seem to successfully pass the existing unit tests) are here:

nav_parent buffered_struct

This is fairly simple to do manually. I have also looked somewhat into the compiler side of things and made some simple adaptions that partly generates these examples. However, as current member variables are always plain pointers, it takes better knowledge than I currently have to determine where to split the generating code into code generating 'owning' and 'non-owning' versions. Hence, I hand-edited the generated code.

Comments are quite welcome.

GreyCat commented 7 years ago

Cool ;) If that works with existing tests — that's super cool, we can integrate it pretty fast ;) Thanks!

JaapAap commented 7 years ago

Please give them a spin and have a look. There are no changes in the structure of the generated code, which should make the compiler side of things relatively simple, but non-trivial nevertheless.

JaapAap commented 7 years ago

Also, I doubt whether these two examples cover all places that need changing in the compiler. If you want me to add some other cases, let me know.

JaapAap commented 7 years ago

I have taken this forward a bit and produced a C++11 compiler that is able to produce most of the C++ code and -when compiled - passes the corresponding unit tests. I am still struggling with the switch_cast and type_ternary cases that both generate code with a call to handleAssignmentSimple that currently lacks the information to decide whether owning or non-owning pointer assignment is needed. I wonder whether someone could get me some help with this? Code lives in a fork [here].(https://github.com/JaapAap/kaitai_struct_compiler/tree/c++11). (Manual fixes are available and require very minor changes only.)

JaapAap commented 7 years ago

Thought: maybe it would be possible to have an 'identifier to type' relationship stored somewhere? If so, where would be the best location in the code to fill this table?

scottmsilver commented 7 years ago

I was able to plumb through DataType to places that didn't have it (and id to places that didn't have it) for another thing I was fooling around with.

Scott

On Wed, Sep 20, 2017 at 1:07 AM, Pieter Ober notifications@github.com wrote:

Thought: maybe it would be possible to have an 'identifier to type' relationship stored somewhere? If so, where would be the best location in the code to fill this table?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaitai-io/kaitai_struct/issues/211#issuecomment-330777436, or mute the thread https://github.com/notifications/unsubscribe-auth/AK76SRqgzIvmfbhqUg7snOMycFNn1PXeks5skMfEgaJpZM4Ocqo5 .

JaapAap commented 7 years ago

I have been playing around with this today and have a working solution. However, it relies on some undesirable coincidences (as far as I can see/understand) in the order that the C++ code is generated. I essentially fill the "id->type" table when an attribute or instance is declared (attributeDeclaration and instanceDeclaration, with calls to kaitaiType2NativeType), but I noticed that some code is generated (e.g. assignments in constructors) that uses the corresponding id while the type is still not stored (but for this code the type is actually not needed). This just feels shaky, so the type should ideally be established at some earlier point. Not sure where though...

JaapAap commented 7 years ago

I have improved the solution somewhat. Still, the lack of type info in some places is resolved in a rather ad-hoc manner (e.g. using string matching), but I believe that it by now is a good starting point for further evaluation. I have slightly changed the class compiler to make type info available at the earliest possible opportunity (within the constraints of my still limited understanding). Shall I create a pull request so someone with more insight in the internal operations of Kaitai can have a look at it? At least, the solution seems to work well with the existing unit tests and I'd be happy to explain what I have done in more detail would someone care to ask.

GreyCat commented 7 years ago

@JaapAap I still haven't taken a good look at your work yet, and I'd love to finally get to it (although probably after releasing v0.8). From a very brief glance, I see that your current implementation replaces existing one. This is probably not what we want — we want it to be switchable between "legacy" vs "modern" modes — so it's either passing some sort of argument (in RuntimeConfig, probably) to a CppCompiler, or two distinct CppCompiler classes. Given that your patch introduces lots of new concepts and they're not very trivial, lots of per-member documentation is really welcome too ;)

JaapAap commented 7 years ago

I am still working with the solution (which should indeed be switchable, but I was unable to find the correct hook for doing so) and still run into some problems. Essentially, for my C++11 solution, more context is needed on the types of instances and attributes. I tried to solve that problem by building a 'database' (effectively just a map) that tracks these types when they are determined in the ClassCompiler. I now ran into the problem that the id of those instances and attributes is not necessarily unique when there are classed embedded in other classes (a case at hand are the root and parent class members, but they could be user-defined too). One potential solution is to pass around the ClassSpec of the class that is being compiled and store all types for class/attribute or class/instance combinations, but this affects a substantial amount of interfaces as the ClassSpec would need to be passed around into many of the CppCompiler calls to retrieve the type information.

As I am struggling with the overall architecture and lack full understanding of the inner workings, I might be missing simpler opportunities, or other, simpler way to get access to sufficient info in the CppCompiler. I would really be greatful to receive any thoughts on this.

JaapAap commented 6 years ago

Update: I have managed to limit the amount of type info passed around allowing to minimize changes to existing code outside of the CppComiler. All unit tests that pass for the current C++ pass, with the exception of the cast_to_top, which seems the only case in which a pointer to a user-type (cast_to_top_t) is actually pointed to, but not owned by a member variable (m_header_casted). My current implementation assumes all member pointers own their pointees with the exception of the pointers to the root and parent structures, and I am not sure how to deal with this special case. I'll pose a question in a separate topic to gain better understanding.

berkus commented 6 years ago

Wow, nice. I admit I completely lost touch with this project for a while - too much work stuff.

JaapAap commented 6 years ago

I still need to fix some things but it looks like I can get this working soonish (depending on the 'work stuff' ahead). It would be great if someone could create the infrastructure to introduce a separate C++11 option (I am currently working in the existing c++ files), as I really lack the knowledge to find the right hooks within the architecture. Possibly, this would include a separate set of unit tests as well to allow updating the unit tests too, allowing them to use smart pointers etc. Not sure how much work this would be of course.

GreyCat commented 5 years ago

Ironically, more than a year have passed, but there's some progress on this made recently. I've spent some time trying to reintegrate JaapApp's fork, but eventually just opted to roll out a slightly modular implementation that will keep both C++98, introducing C++11 elements as a switch. Currently, it's live in master, and it currently passes everything except for:

One can observe results in new CI — look for the columns which names start with "cpp_stl_11", or just execute the following in JS console:

app.__vue__.gridColumns = ["name", "cpp_stl_11/gcc4.8_linux", "cpp_stl_11/clang7.3_osx", "cpp_stl_11/clang3.5_linux", "ruby/2.3"]

Ideally, I'd like to add Windows compilation result there too (both in C++98 and C++11) modes. There's a new CLI option, namely:

  --cpp-standard <standard>
                           C++ standard to target (C++ only, supported: 98, 11, default: 98)

and, internally, this options controls whole new set of C++-specific configurations, i.e. CppRuntimeConfig, currently governing usage of:

If everyone's still interested, please take a look and tell me what you think of it. I'll try to iron out remaining problematic tests soon.

JaapAap commented 5 years ago

Will have a look at this soon!

kismetwireless commented 5 years ago

Awesome, I'll try to take a look soon as well!

CPP11 and above smart pointers have advantages, but can be worked around - so long as the CPP98 support doesn't do alloc and exceptions in the constructor leading to leaks.

GreyCat commented 5 years ago

@kismetwireless Please note that currently we offer --no-auto-read compilation option, which moves non-smart pointers handling from constructor into distinct _read() method, which can be wrapped in try-catch block safely — I believe that's exactly what you were talking about.

GreyCat commented 5 years ago

All C++11 test pass now (not worse than C++98), as evidenced by new CI. The only problem that's left now that I see is checking for memory leaks and resolving those.

GreyCat commented 5 years ago

Leak check fixed. Valgrind now shows:

mosolovsa commented 5 years ago

Is it possible to make generated C++ code a little bit more cache-friendly?

Add support for fix sized arrays (I assume std::array would be preferable to plain C arrays in most cases) when the size of an array known at compile time.

As of now simple.ksy:

meta:
  id: simple
  file-extension: simple
  endian: le

seq:
  - id: please_give_me_array_of_int32t
    type: s4
    repeat: expr
    repeat-expr: 7

will produce:

class simple_t : public kaitai::kstruct {
...
private:
    std::vector<int32_t>* m_please_give_me_array_of_int32t;
...
};

It would be slightly better to me at least don't allocate vector on a heap (that would do the construction/destruction a little bit simplier also):

class simple_t : public kaitai::kstruct {
...
private:
    std::vector<int32_t> m_please_give_me_array_of_int32t;
...
};

But, frankly, I would prefer:

class simple_t : public kaitai::kstruct {
...
private:
    std::array<int32_t, 7> m_please_give_me_array_of_int32t;
...
};

However, this would break backward compatibility. As a compromise, it would be nice to add some support for fixed size repeat, that would produce the code above:

seq:
  - id: please_give_me_array_of_int32t
    type: s4
    repeat: fix
    repeat-fix: 7

I assume that can be preferable also in Rust and probably Go.

All apologies if requested features are already present, reading docs and playing with ksy's from library in web-ide haven't let me to discover those.

Is there any reason to store pointer to std::vector instead of just embedding std::vector to generated class? Is there a possibility to resolve the dependency graph and embed a class instance instead of pointer to an instance?

GreyCat commented 5 years ago

Is it possible to make generated C++ code a little bit more cache-friendly?

Maybe. In general, all your suggestions are valid, but typically come with lots of strings attached and do not work universally in all cases. We might implement them in some time in future as certain narrow case optimization.

Add support for fix sized arrays (I assume std::array would be preferable to plain C arrays in most cases) when the size of an array known at compile time.

It's a complex decision that we may revisit some time in future. I'd like to note:

  1. It's not always easy to tell if size if known in compile time or not. For example, repeat-expr: 42 + a - a is effectively always constant and equal to 42, but it takes certain symbolic resolution engine to derive that.
  2. For serialization use case — given that you need to create / modify existing array, your array might need to grow or shrink.
  3. If you want to avoid heap allocations, std::array needs to go together with embedding user types directly into it, i.e. std::array<foo_t>, not std::array<std::unique_ptr<foo_t>>, otherwise it would be still all on the heap.
  4. Quite often you don't want to have all on the stack — for example, if your data size is relatively large and you'd exhaust stack quickly. One needs to have specific mechanism to force heap allocation (i.e. some command line flag or KSY annotation), and we don't have any at the moment.

It would be slightly better to me at least don't allocate vector on a heap (that would do the construction/destruction a little bit simplier also): ... Is there any reason to store pointer to std::vector instead of just embedding std::vector to generated class?

That's also somewhat complex decision, i.e. having std::unique_ptr<std::vector<int32_t>> vs std::vector<int32_t>, which boils down to:

However, this would break backward compatibility

That's correct. We might implement something like that in future with different option for C++ pointers.

seq:
  - id: please_give_me_array_of_int32t
    type: s4
    repeat: fix
    repeat-fix: 7

I firmly believe that we need to separate pure declaration of structure vs implementation details only related to parsers generated in certain languages. From data structure perspective, it's all the same: it's just a repetition with certain number of repetitions. We might have annotations and/or config options which fine-tune codegen, but we should not embed these into main language (e.g. diagram output will never make any use of such repeat: fix).

Is there a possibility to resolve the dependency graph and embed a class instance instead of pointer to an instance?

Again, it might be possible, but we need to think of good solutions for all points outlined above. C++ codegen is already pretty convoluted, and having fine-tuning mechanism that allows to locate certain things on stack will make it even more complex.

It will be also really cool to see if this brings any real performance benefits. My limited experience with C++ so far shows that more often than not it's not worth the trouble.

KOLANICH commented 5 years ago

IMHO: no need for repeat: fix, repeat:expr should do the job, we on.y need to check if that a compile-time expression.

generalmimon commented 4 years ago

I'm taking this issue out of the 0.9 milestone, because @GreyCat said that he would treat (this issue, i.e. https://github.com/kaitai-io/kaitai_struct/issues/211) _for purposes of 0.9 as "ensuring that we have no leaks in CI in both cpp_stl_98 and cpp_stl11". As far as I know, the only leak that's left to fix for C++98 is covered in issue https://github.com/kaitai-io/kaitai_struct/issues/214, which I'm currently trying to eliminate.