Open GreyCat opened 7 years ago
Let me start it with defining some areas that need attention. From what I see, things that are bound to change are:
std::vector
_read()
method, instantiation of typesfoo()->bar()->baz()
there would be foo().bar().baz()
or some combinations?)I am, by no means, an expert.
My 2 cents
On Wed, Jul 19, 2017 at 5:43 AM Mikhail Yakshin notifications@github.com wrote:
Let me start it with defining some areas that need attention. From what I see, things that are bound to change are:
- Data types used, declaration
- Usage of dumb pointers vs references vs smart pointers
- Wrapping of various stuff in std::vector
- Initialization / constructor, basic initialization
- _read() method, instantiation of types
- Destruction and cleanup (ideally, there should be none — I wonder if that's possible)
- Usage (i.e. instead of using foo()->bar()->baz() there would be foo().bar().baz() or some combinations?)
- Exception handling (?)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaitai-io/kaitai_struct/issues/211#issuecomment-316373637, or mute the thread https://github.com/notifications/unsubscribe-auth/AK76SaF7Wlyz8ZFma1fcHT-lncpIyQogks5sPfnfgaJpZM4Ocqo5 .
Exceptions are already widely used in C++ runtime, and it's one of a major design choices, i.e. it's unlikely to change.
I'm working on the leaks, see #213 and hopefully soon we'll see progress in #209.
I second this!
To make this discussion more to the point, I challenge everyone who wants to take part, to port some of our existing tests to C++11/14/17/whatever. "hello_world.ksy" is probably trivial and not that interesting, but I'd love to see stuff like:
"Port" in what sense - the ksy generator or how the generated code should look like as a cpp/h?
How the generated code should look ;) I'll do the fixes to ksc myself :)
Ok, I will give it a shot tomorrow/day after.
https://gist.github.com/berkus/c12f110d12bd71b7a31394e182e67e3f this is a first approach (perhaps not entirely working) to buffered_struct.
Just wondering: you opted for shared_ptr. Is this really needed (or useful, and worth the associated performance impact)?
The existing interface has all ownership with the entities derived from kstruct - is it really needed to extend the lifespan of the parsed data beyond the lifetime of the kstruct decedents?
It's not required, but it greatly simplifies destructor code and makes it near impossible to make ownership errors.
Of course if entire ownership is contained in the structure it should work with manual pointers just fine.
The nav_parent example is more complicated; and additional problem there is it's not possible to use shared_ptrs there :) so plain pointers might be better.
One extra thing that might be worth doing is move all the boilerplate initialization code (constructors and getters) to headers, while leaving only _read()
methods in cpp file. If you do keep plain pointers then dtors will also stay in cpp.
https://gist.github.com/berkus/9e656fcc2631a13614f19bc85746365d it's a sketch for nav_parent example. Notice that m__root could not be made a weak/shared_ptr for the outer class, as it would require shared_from_this() init call in ctor.
This is definitely not final and could be improved further.
@JaapAap @berkus Note that there might be two use cases for C++ classes:
Read-only model: kstruct-based objects are created from parsing, thus _read()
is essentially a part of constructor. Eagerly read members (i.e. seq
attributes) are immutable and have only getter. It is generally safe to assume that whole hierarchy of kstruct-based objects that might be created due to parsing process is owned by root kstruct-based object and it's generally ok to delete everything as that object disappears.
Read-write model: it is possible to build objects manually and then call something like _write()
to serialize them to the stream. Probably typical use case would be something like that. Given this KSY:
meta:
id: foo
seq:
- id: coord_x
type: s1
- id: coord_y
type: s1
- id: comment
type: comment_block
types:
comment_block:
seq:
- id: contents
type: strz
encoding: UTF-8
One might want to use it like that:
kaitai::kstream* my_stream; // somehow initialized the stream and readied it for writing
comment_block_t my_comment();
my_comment->set_contents("Some string");
foo_t my_obj();
my_obj->set_coord_x(100);
my_obj->set_coord_y(100);
my_obj->set_comment(&my_comment);
my_obj->_write(my_stream);
Note that there are two objects and the inner one is created by calling code, and I guess it's generally good idea to assume that it's owned by that code as well (i.e. it would be cleaned up by caller, for example, with function's stack cleanup).
I suppose that the caller could also hand over ownership to the comment_block_t, so a unique_ptr could be used? It might simplify things if comment_block_t always has/takes ownership?
My remark on use of shared_ptr
was actually meant to discuss whether using unique_ptr
instead would be a good option instead - I was not thinking of plain raw pointers here. I still have the feeling that essentially replacing all owning pointers by a unique_ptr
, and keep all non-owning pointers to be raw pointers (or, in C++17 terms, alternatively use observer_ptr
to document more explicitly that these pointers are not owning) should be able to get optimal performance and the same functionality that is present now.
This would also be in line with the C++ Core Guidelines, to be found for example here: http://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#S-resource
I don't consider myself an expert and would be happy to be convinced that this is not a good (or suboptimal) solution.
unique_ptrs are better than shared_ptrs if possible to use, but i'm not sure how smart is ksy compiler to figure out ownership - need more info.
I'd keep non-owning ptrs as weak_ptrs though, so that they are automatically cleaned up. This means using shared_ptrs.
My take on is it that he compiler has this knowledge available – it uses it now to do memory allocations and de-allocations. Indeed, the knowledge will now be needed in multiple locations during the code-generation process though, but similarly, this knowledge would be required to determine where to use shared_ptr and where to use weak_ptr instead?
What do you mean by ‘automatically cleaned up’ for weak_ptrs? Are they that different from the raw pointers obtained by calling ‘get’ on a unique_ptr?
@JaapAap weak_ptr will be nullptr when it's "parent" shared_ptr is destroyed. Raw pointers don't become nullptr automatically.
What I meant is that calling .get()
on a unique_ptr
returns either the underlying raw pointer to some pointee, or a nullptr
when the unique_ptr
is not (or no longer) associated with a pointee. So the behavior is very similar, right?
Yes, but you cannot have a unique_ptr to the parent.
I see your point. I suppose the current code assumes/relies on the fact that the parent is never accessed after it has been destroyed, but possibly incorrectly? I am still trying to get my head around the logic behind the allocation/destruction order.
In current code, "parent" is assumed to always exist when child objects are alive. On destruction, every object destroys everything that it allocated, thus it usually includes all the objects that were allocated with new some_type_t(..., this, ...)
.
Thanks - would that imply that all objects are essentially destructed (in some sequence) in one go (that is, there are no calls accessing other objects during this sequence)? Would you know of any potential situations in which the assumption would get violated?
All object destruction code (i.e. delete
operators) in generated C++ files is generated only in destructors. There are no stray delete
statements elsewhere. One instantiates whole KSY (or part of that KSY, i.e. a subtype) via new
/ stack allocation in one go, and it generally gets destructed in one go (as stack goes out of scope, or the one who done new
calls delete
on main object).
Just as a try, I have done what I previously proposed: replace all 'owning' raw pointers by unique_ptr
and leave all 'non-owning' pointers untouched. The results (which do seem to successfully pass the existing unit tests) are here:
This is fairly simple to do manually. I have also looked somewhat into the compiler side of things and made some simple adaptions that partly generates these examples. However, as current member variables are always plain pointers, it takes better knowledge than I currently have to determine where to split the generating code into code generating 'owning' and 'non-owning' versions. Hence, I hand-edited the generated code.
Comments are quite welcome.
Cool ;) If that works with existing tests — that's super cool, we can integrate it pretty fast ;) Thanks!
Please give them a spin and have a look. There are no changes in the structure of the generated code, which should make the compiler side of things relatively simple, but non-trivial nevertheless.
Also, I doubt whether these two examples cover all places that need changing in the compiler. If you want me to add some other cases, let me know.
I have taken this forward a bit and produced a C++11 compiler that is able to produce most of the C++ code and -when compiled - passes the corresponding unit tests. I am still struggling with the switch_cast
and type_ternary
cases that both generate code with a call to handleAssignmentSimple
that currently lacks the information to decide whether owning or non-owning pointer assignment is needed. I wonder whether someone could get me some help with this? Code lives in a fork [here].(https://github.com/JaapAap/kaitai_struct_compiler/tree/c++11). (Manual fixes are available and require very minor changes only.)
Thought: maybe it would be possible to have an 'identifier to type' relationship stored somewhere? If so, where would be the best location in the code to fill this table?
I was able to plumb through DataType to places that didn't have it (and id to places that didn't have it) for another thing I was fooling around with.
Scott
On Wed, Sep 20, 2017 at 1:07 AM, Pieter Ober notifications@github.com wrote:
Thought: maybe it would be possible to have an 'identifier to type' relationship stored somewhere? If so, where would be the best location in the code to fill this table?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaitai-io/kaitai_struct/issues/211#issuecomment-330777436, or mute the thread https://github.com/notifications/unsubscribe-auth/AK76SRqgzIvmfbhqUg7snOMycFNn1PXeks5skMfEgaJpZM4Ocqo5 .
I have been playing around with this today and have a working solution. However, it relies on some undesirable coincidences (as far as I can see/understand) in the order that the C++ code is generated. I essentially fill the "id->type" table when an attribute or instance is declared (attributeDeclaration
and instanceDeclaration
, with calls to kaitaiType2NativeType
), but I noticed that some code is generated (e.g. assignments in constructors) that uses the corresponding id while the type is still not stored (but for this code the type is actually not needed). This just feels shaky, so the type should ideally be established at some earlier point. Not sure where though...
I have improved the solution somewhat. Still, the lack of type info in some places is resolved in a rather ad-hoc manner (e.g. using string matching), but I believe that it by now is a good starting point for further evaluation. I have slightly changed the class compiler to make type info available at the earliest possible opportunity (within the constraints of my still limited understanding). Shall I create a pull request so someone with more insight in the internal operations of Kaitai can have a look at it? At least, the solution seems to work well with the existing unit tests and I'd be happy to explain what I have done in more detail would someone care to ask.
@JaapAap I still haven't taken a good look at your work yet, and I'd love to finally get to it (although probably after releasing v0.8). From a very brief glance, I see that your current implementation replaces existing one. This is probably not what we want — we want it to be switchable between "legacy" vs "modern" modes — so it's either passing some sort of argument (in RuntimeConfig, probably) to a CppCompiler, or two distinct CppCompiler classes. Given that your patch introduces lots of new concepts and they're not very trivial, lots of per-member documentation is really welcome too ;)
I am still working with the solution (which should indeed be switchable, but I was unable to find the correct hook for doing so) and still run into some problems. Essentially, for my C++11 solution, more context is needed on the types of instances and attributes. I tried to solve that problem by building a 'database' (effectively just a map) that tracks these types when they are determined in the ClassCompiler. I now ran into the problem that the id of those instances and attributes is not necessarily unique when there are classed embedded in other classes (a case at hand are the root and parent class members, but they could be user-defined too). One potential solution is to pass around the ClassSpec of the class that is being compiled and store all types for class/attribute or class/instance combinations, but this affects a substantial amount of interfaces as the ClassSpec would need to be passed around into many of the CppCompiler calls to retrieve the type information.
As I am struggling with the overall architecture and lack full understanding of the inner workings, I might be missing simpler opportunities, or other, simpler way to get access to sufficient info in the CppCompiler. I would really be greatful to receive any thoughts on this.
Update: I have managed to limit the amount of type info passed around allowing to minimize changes to existing code outside of the CppComiler. All unit tests that pass for the current C++ pass, with the exception of the cast_to_top, which seems the only case in which a pointer to a user-type (cast_to_top_t
) is actually pointed to, but not owned by a member variable (m_header_casted
). My current implementation assumes all member pointers own their pointees with the exception of the pointers to the root and parent structures, and I am not sure how to deal with this special case. I'll pose a question in a separate topic to gain better understanding.
Wow, nice. I admit I completely lost touch with this project for a while - too much work stuff.
I still need to fix some things but it looks like I can get this working soonish (depending on the 'work stuff' ahead). It would be great if someone could create the infrastructure to introduce a separate C++11 option (I am currently working in the existing c++ files), as I really lack the knowledge to find the right hooks within the architecture. Possibly, this would include a separate set of unit tests as well to allow updating the unit tests too, allowing them to use smart pointers etc. Not sure how much work this would be of course.
Ironically, more than a year have passed, but there's some progress on this made recently. I've spent some time trying to reintegrate JaapApp's fork, but eventually just opted to roll out a slightly modular implementation that will keep both C++98, introducing C++11 elements as a switch. Currently, it's live in master, and it currently passes everything except for:
One can observe results in new CI — look for the columns which names start with "cpp_stl_11", or just execute the following in JS console:
app.__vue__.gridColumns = ["name", "cpp_stl_11/gcc4.8_linux", "cpp_stl_11/clang7.3_osx", "cpp_stl_11/clang3.5_linux", "ruby/2.3"]
Ideally, I'd like to add Windows compilation result there too (both in C++98 and C++11) modes. There's a new CLI option, namely:
--cpp-standard <standard>
C++ standard to target (C++ only, supported: 98, 11, default: 98)
and, internally, this options controls whole new set of C++-specific configurations, i.e. CppRuntimeConfig, currently governing usage of:
usePragmaOnce
— to use #pragma once
instead of #ifdef ... #include ... #endif
guardspointers
— with a choice of RawPointers
(as in C++98), UniqueAndRawPointers
(as in C++11 mode), or SharedPointers
(not implemented properly right now)If everyone's still interested, please take a look and tell me what you think of it. I'll try to iron out remaining problematic tests soon.
Will have a look at this soon!
Awesome, I'll try to take a look soon as well!
CPP11 and above smart pointers have advantages, but can be worked around - so long as the CPP98 support doesn't do alloc and exceptions in the constructor leading to leaks.
@kismetwireless Please note that currently we offer --no-auto-read
compilation option, which moves non-smart pointers handling from constructor into distinct _read()
method, which can be wrapped in try-catch block safely — I believe that's exactly what you were talking about.
All C++11 test pass now (not worse than C++98), as evidenced by new CI. The only problem that's left now that I see is checking for memory leaks and resolving those.
Leak check fixed. Valgrind now shows:
Is it possible to make generated C++ code a little bit more cache-friendly?
Add support for fix sized arrays (I assume std::array would be preferable to plain C arrays in most cases) when the size of an array known at compile time.
As of now simple.ksy:
meta:
id: simple
file-extension: simple
endian: le
seq:
- id: please_give_me_array_of_int32t
type: s4
repeat: expr
repeat-expr: 7
will produce:
class simple_t : public kaitai::kstruct {
...
private:
std::vector<int32_t>* m_please_give_me_array_of_int32t;
...
};
It would be slightly better to me at least don't allocate vector on a heap (that would do the construction/destruction a little bit simplier also):
class simple_t : public kaitai::kstruct {
...
private:
std::vector<int32_t> m_please_give_me_array_of_int32t;
...
};
But, frankly, I would prefer:
class simple_t : public kaitai::kstruct {
...
private:
std::array<int32_t, 7> m_please_give_me_array_of_int32t;
...
};
However, this would break backward compatibility. As a compromise, it would be nice to add some support for fixed size repeat, that would produce the code above:
seq:
- id: please_give_me_array_of_int32t
type: s4
repeat: fix
repeat-fix: 7
I assume that can be preferable also in Rust and probably Go.
All apologies if requested features are already present, reading docs and playing with ksy's from library in web-ide haven't let me to discover those.
Is there any reason to store pointer to std::vector instead of just embedding std::vector to generated class? Is there a possibility to resolve the dependency graph and embed a class instance instead of pointer to an instance?
Is it possible to make generated C++ code a little bit more cache-friendly?
Maybe. In general, all your suggestions are valid, but typically come with lots of strings attached and do not work universally in all cases. We might implement them in some time in future as certain narrow case optimization.
Add support for fix sized arrays (I assume std::array would be preferable to plain C arrays in most cases) when the size of an array known at compile time.
It's a complex decision that we may revisit some time in future. I'd like to note:
repeat-expr: 42 + a - a
is effectively always constant and equal to 42, but it takes certain symbolic resolution engine to derive that.std::array
needs to go together with embedding user types directly into it, i.e. std::array<foo_t>
, not std::array<std::unique_ptr<foo_t>>
, otherwise it would be still all on the heap.It would be slightly better to me at least don't allocate vector on a heap (that would do the construction/destruction a little bit simplier also): ... Is there any reason to store pointer to std::vector instead of just embedding std::vector to generated class?
That's also somewhat complex decision, i.e. having std::unique_ptr<std::vector<int32_t>>
vs std::vector<int32_t>
, which boils down to:
->
everywhere and don't distinguish between having to do .size()
on vectors but ->foo()
on all others; this can be alleviated by a major push to use .
everywhere instead of ->
, i.e. by the heavier use of references (&
) instead of pointers (*
).However, this would break backward compatibility
That's correct. We might implement something like that in future with different option for C++ pointers.
seq: - id: please_give_me_array_of_int32t type: s4 repeat: fix repeat-fix: 7
I firmly believe that we need to separate pure declaration of structure vs implementation details only related to parsers generated in certain languages. From data structure perspective, it's all the same: it's just a repetition with certain number of repetitions. We might have annotations and/or config options which fine-tune codegen, but we should not embed these into main language (e.g. diagram output will never make any use of such repeat: fix
).
Is there a possibility to resolve the dependency graph and embed a class instance instead of pointer to an instance?
Again, it might be possible, but we need to think of good solutions for all points outlined above. C++ codegen is already pretty convoluted, and having fine-tuning mechanism that allows to locate certain things on stack will make it even more complex.
It will be also really cool to see if this brings any real performance benefits. My limited experience with C++ so far shows that more often than not it's not worth the trouble.
IMHO: no need for repeat: fix
, repeat:expr
should do the job, we on.y need to check if that a compile-time expression.
I'm taking this issue out of the 0.9 milestone, because @GreyCat said that he would treat (this issue, i.e. https://github.com/kaitai-io/kaitai_struct/issues/211) _for purposes of 0.9 as "ensuring that we have no leaks in CI in both cpp_stl_98 and cpp_stl11". As far as I know, the only leak that's left to fix for C++98 is covered in issue https://github.com/kaitai-io/kaitai_struct/issues/214, which I'm currently trying to eliminate.
Following a discussion in #209, there seems to be some interest in developing a C++ target that will work with all the might options that modern C++ offers us, like
std::unique_ptr
,std::shared_ptr
,std::move
, removing direct invocations ofnew
anddelete
, allocating all the stuff mostly on stack instead of on heap, etc, etc.I invite everyone interested in participating in discussion and getting a solid plan ready, what exactly should change, how declarations, invocations, object creation & destruction, usage, etc, should look like.
Cc @scottmsilver @JaapAap