Add C++17 Port - Githubissues

smarr / are-we-fast-yet

Are We Fast Yet? Comparing Language Implementations with Objects, Closures, and Arrays

Other

334 stars 36 forks source link

Add C++17 Port #84

Closed smarr closed 7 months ago

smarr commented 8 months ago

This PR adds a C++ port, using C++17 language and library features.

Below, I am also adding the corresponding rules and guidance, which is still open for discussion.

The first version of the code in this PR was written by @Hjeanningros and can be found here.

@rochus-keller also provided valuable inspiration with his earlier port here.

The GitHub CI setup currently doesn't run style checks, because of some issue with the config format, probably using the wrong clang version. The GHActions Ubuntu image is also updated to 22.04, except for SOMns, which requires Python 2 because of an old mx version.

C++ Guidelines

With C++, we add support for the first language that does not have garbage collection, which comes with a new dimension of issues to be considered.

Explicit Memory Management Rules:

benchmarks have to run without memory leaks
stack allocation can be used where natural and where it does not change the nature of the benchmark, for instance when an object/array is used only in the dynamic scope of a function
existing data structures should be used where possible to manage dynamically-created objects, for instance, iterate over an already existing list of objects at the end of the benchmark or a method to free them
changes to code structure and APIs should be as minimal as possible, for instance, if a method returns an allocated object, leave it as such, and let the caller manage the memory
if the useful lifetime of object fields is restricted to a method, the allocations referenced by these fields should be freed before the end of a method, but the field should remain a field.
for arbitrary object graphs, as in DeltaBlue, memory/object_tracker.h can be used to free the objects when not needed. The use of shared_ptr may also be appropriate, but did not work for DeltaBlue.

Memory Management Strategies Per Benchmarks:

CD use value objects for most data. Since it's a tree, the red/black tree is trivially managed by deleting the nodes from the root. Vectors are managed explicitly for the voxel map. Don't miss the empty vectors that are not passed on as result though.
DeltaBlue uses object_tracker, since there are cyclic dependencies, but we can free the full setup once it's not needed. A mix of shared_ptr and weak_ptr would probably also work.
Havlak manages memory explicitly by assigning ownership to specific classes. Specifically, the ControlFlowGraph owns the basic blocks and block edges, the LoopStructureGraph owns the loops, the HavlakLoopFinder owns its data including UnionFindNodes. Thus, the destructors can free the corresponding memory.
Json relies on JSON documents being trees, and uses the tree to free objects. The major tradeoff here is that we need to allocate true, false, and null literal objects to have a uniform memory representation. Though, otherwise, we do not require any management overhead.
Richards uses object_tracker for simplicity. It could use shared_ptr and accounting for cyclic references that would work, too. A naive using of the task list did not seem to work, but I might have missed something.
Bounce allocates everything statically, i.e., on the stack.
List trivially uses the list structure for freeing the list.
Mandelbrot does not allocate any data structures.
NBody allocates everything statically, i.e., on the stack.
Permute allocates an array dynamically, and frees it directly. Since the benchmark holds the reference in a field, and allocates on each iteration, the new/delete dance is needed to comply
Queens allocates its arrays dynamically, and frees them directly, same as Permute.
Sieve allocates everything statically, i.e., on the stack.
Storage allocates its tree dynamically, and frees it from the root.
Towers allocates the disks dynamically, which form a linked list, that is used to free them once not needed anymore.

General C++-isms:

the benchmarks, where possible, can be in headers only to match the code structure of other languages
we use clang-tidy and clang-format
use std::array for fixed-sized arrays
use const where it is appropriate, but it won't really work with containers and can be problematic for value classes
use auto and auto* to make code more concise as recommended by linter, for instance for allocations
use annotations like [[nodiscard]] where indicated by the linter
use modern C++-isms, for instance range loops and .at instead of [] on std::array
use initializer syntax for default values and member initializers lists when depending on constructor parameter
prefer int32_t, size_t and similar to be more explicit about semantics and size of value, plain int/long shouldn't be used
avoid changing signatures for the sake of the compiler. It should do an appropriate return-value optimization itself.

eregon commented 8 months ago

Nice! Would you have any benchmark numbers and how it compares to other languages?

smarr commented 8 months ago

The benchmarking infrastructure is currently blocked by a student who wants to submit their PhD as soon as possible. So, coming soon...

I am also still working on some odd performance numbers compared to the other two C++ versions. So, yeah, soon...

rochus-keller commented 8 months ago

Very interesting, thank you.

I just run your and my version of the benchmark on my Windows 11 machine (Intel Core i3-1215U 1200MHz, 8GB RAM) with MinGW x64 12.2. Here are the results:

grafik

Both perform nearly the same with geomean. Json and Havlak likely seem to do additional copies in your version, but I didn't check the source code yet.

smarr commented 8 months ago

Json and Havlak likely seem to do additional copies in your version, but I didn't check the source code yet.

I did look a little at Havlak, but couldn't really see much in a basic time profiler. Though, I noticed that the Vector implementation in your version is very different from the normal one. The growing strategy and generally functionality deviate, which may result in different performance. BasicBlock is also supposed to have a virtual customHash() method, which seems to be solved with template programming in your version.

rochus-keller commented 8 months ago

My major changes to the Vector implementation compared to Java was that the default constructor no longer does an initial allocation (so creating Vectors becomes cheaper, especially when embedded in classes by value), and the operator= does no extra allocations. Also note that I used my original version of the code (i.e. the main branch), not the version where I replaced the dynamic by static arrays (see https://github.com/smarr/are-we-fast-yet/issues/80#issuecomment-1674872410).

EDIT: the hash function is likely insignificant; my version uses a functor passed by template argument, so my implementation makes a function call and yours (assumingly) a method call; I thus spare the dispatch, which is very fast in C++.

smarr commented 8 months ago

My major changes to the Vector implementation compared to Java was that the default constructor no longer does an initial allocation (so creating Vectors becomes cheaper, especially when embedded in classes by value), and the operator= does no extra allocations. Also note that I used my original version of the code (i.e. the main branch), not the version where I replaced the dynamic by static arrays (see #80 (comment)).

Hm, I am specifically looking at this: https://github.com/rochus-keller/Are-we-fast-yet/blob/main/Cpp/som/Vector.h#L34-L46

So, here, I am concerned about the following

a different code structure (enlarge is in no version of AWFY)
different field semantics (length vs capacity, and the correspondingly different checks of length vs last - first)
different growing factor 2 vs 2+50
indeed initial size 50 vs 0 seems problematic
at() throws instead of returning a nullptr

If we aim to expose the compilers/optimizers to the same challenges, these changes seem problematic to me, because I would naively assume that they change what the programs are doing.

rochus-keller commented 8 months ago

I would assume that the only performance relevant point is 4) the initial size of 0 vs. 50, which calls 50 times the default constructor with no need, if elements are by value. This adds up when Vector is used by value in other classes. The other points make likely less than 10% difference. But I will eventually go over my code again, as soon as your version is official, so that we can compare "modern" with "modern modern" C++. What is actually your official target C++ version (didn't check yet)?

smarr commented 8 months ago

I would assume that the only performance relevant point is 4) the initial size of 0 vs. 50, which calls 50 times the default constructor with no need, if elements are by value.

Indeed, changing this and removing the extra +50 on the growth brings the two versions roughly on par.

Though, the "no need" discussion is in my mind another discussion we may want to have than rule compliance. Same with the value "50". I have no recollection where it is coming from, and it seems high. So, one could think about changing all implementations to not allocate storage initially, or use a conservative size of 3 or 10. But I think that's something that would need to be applied to all languages.

The other points make likely less than 10% difference.

Yeah, I am not claiming it makes a difference in practice, I am merely claiming that these are differences, and at least in my mind, not fully rule compliant. For language implementations that use interpreters, any change of code structure can have a major impact, because for instance function calls have a noticeable cost.

But I will eventually go over my code again, as soon as your version is official, so that we can compare "modern" with "modern modern" C++. What is actually your official target C++ version (didn't check yet)?

Well, "official"... I mean, I am happy to have these types of discussions to make sure things "make sense".

I aimed for C++17 as language target, also when choosing things like std::any and std::optional. Originally, I was also hoping to use modules, but C++20 doesn't seem to be fully supported yet. Though, it might be good enough for what's needed here.

rochus-keller commented 8 months ago

Though, the "no need" discussion is in my mind another discussion we may want to have than rule compliance. Same with the value "50". I have no recollection where it is coming from, and it seems high.

But keep in mind that C++ is the first language in your benchmark where objects can be either reference or value types. The issue with the 50 preallocated slots is strongly connected with using objects by value and the fact that C++ generates a couple of methods for every class (e.g. a constructor) even if there isn't one in the code. I recommend that this is considered in your rules, otherwise you risk to end up with code no experienced C++ programmer would write. The "+50" have likely very little influence on performance; it's just to accommodate for newLength==0; we could also write if( newLength == 0 ) newLength = 50; else newLength *= 2; instead if you prefer.

C++20 doesn't seem to be fully supported yet

Even your C++17 code didn't compile with my pretty recent Visual Studio compiler, why I switched to MinGW.

smarr commented 8 months ago

Json and Havlak likely seem to do additional copies in your version

I now had a look at Json. The key difference here seems to be that the current character is represented by a char. This optimization explains the difference in performance (I tested this on my version). The other differences seem minor, though, worth noting: I didn't optimize true, false, null literals to be able to treat the json structure as a pure tree and easily delete/free it.

The whole char discussion also already came up in the Java version, and back then, we decided that this optimization is not in the spirit of the rules. So, Java also uses the plain String instead of it's char.

smarr commented 8 months ago

otherwise you risk to end up with code no experienced C++ programmer would write

But it's really a major change of behavior. Yes, there's a goal to make sure the code is "idiomatic", but that can't come at the cost of behavioral changes. I am very happy to consider changing the Vector behavior in general, but I strongly believe that this would need to be applied consistently. The fortunate thing is that this should be a local change to Vector, so, it shouldn't have too many unintended consequences in terms of code. Though, it will have consequences in terms of what the benchmarks measures.

Even your C++17 code didn't compile with my pretty recent Visual Studio compiler, why I switched to MinGW.

Hm, what does it complain about? https://en.cppreference.com/w/cpp/compiler_support/17 suggests that MSVC version 19.10 introduced at least std::any and std::optional.

smarr commented 8 months ago

Hmm, well, the other question one might want to ask is whether Vector should really be treated as a value. I do stack allocations for it in a number of places, but I think I rarely use it as a true value. The few places where I do, it's fields in objects that always allocate it. In CD and Havlak where it is used in a Vector itself, I did only use pointers.

rochus-keller commented 8 months ago

The whole char discussion also already came up in the Java version, and back then, we decided that this optimization is not in the spirit of the rules. So, Java also uses the plain String instead of it's char.

C++ doesn't even have a switch statement which can compare strings. It will be pretty difficult to convince the C++ community to use a string of length one instead of a char. I did comparisons (see https://github.com/rochus-keller/Are-we-fast-yet/blob/2f44088dfc0500a597cee3f1f9a165502c26b28b/Cpp/Json.cpp#L33) and decided for char mostly because I was interested in comparing my different language implementations with a decently credible C++ implementation; so I left the design similar to Java als long as not too costly, but in this case using a string instead of char would have made my performance claims implausible (I have even received very nasty comments for the parts I left like Java, even though the impact on performance was small).

rochus-keller commented 8 months ago

Yes, there's a goal to make sure the code is "idiomatic", but that can't come at the cost of behavioral changes. I am very happy to consider changing the Vector behavior in general, but I strongly believe that this would need to be applied consistently.

The original design assumes a language where objects are "by reference" and optimizes for such a language in that there is an inital length of 50. Allocating the 50 for those languages causes no additional cost. In C++, we could (and maybe should if you insist on maximum similarity with the original design) also use all objects by reference. But if you allow for "objects by value", then the rules should also consider constructors and assignment operators.

EDIT: I would rather consider to change the original design, so that every Vector implementation starts with length 0.

Hm, what does it complain about?

Didn't even look at the errors, just switched. I think my compiler is also version 17 i.e. MSVC version 19. But using MinGW instead is no problem from my perspective. EDIT: I will try again and check what's the problem.

rochus-keller commented 8 months ago

the other question one might want to ask is whether Vector should really be treated as a value

That decision should then affect all classes. You could decide to fully do without profiting from the capability of C++ to use classes by value (either on stack or embedded als fields in other objects) in favour of comparability. But that would be very difficult to sell to the C++ community.

I do stack allocations for it in a number of places, but I think I rarely use it as a true value. The few places where I do, it's fields in objects that always allocate it. In CD and Havlak where it is used in a Vector itself, I did only use pointers.

I used classes by value whereever it makes sense, even when a Vector is a value of another collection. I did the same for my Oberon+ implementation, because from my point of view that's an essential advantage compared to languages like Java or Smalltalk. So if the benchmark should demonstrate the full potential of languages like C++ or Oberon compared to more dynamic languages, using classes by value should be allowed.

smarr commented 7 months ago

https://github.com/smarr/are-we-fast-yet/pull/90 proposes a change to Vector to initialize the array lazily.

smarr commented 7 months ago

The whole char discussion also already came up in the Java version, and back then, we decided that this optimization is not in the spirit of the rules. So, Java also uses the plain String instead of it's char.

C++ doesn't even have a switch statement which can compare strings. It will be pretty difficult to convince the C++ community to use a string of length one instead of a char.

I appreciate that single-char strings might not be idiomatic to C++ programmers. Though, the concern of comparability is more important to me in this cases. We want to expose the different compilers to the same challenges, or at least as close as we can get to the same challenges given inherent language differences.

I will adapt the list of abstractions of our imaginary Core Language accordingly, and explicitly exclude char as an abstraction in that language: https://github.com/smarr/are-we-fast-yet/pull/91

rochus-keller commented 7 months ago

single-char strings might not be idiomatic to C++ programmers...Though, the concern of comparability is more important to me in this cases.

Isn't this a similar issue like the "by value" vs "by reference" topic? I understand the rule when it comes to core language vs. standard library features. But shouldn't the features of the core language be honored when available, especially when performance relevant? A language designer might decide for valid reasons to only offer dynamic strings and no char data type, or a char data type which is a dynamic object as well; but should we then "punish" (in that the possibilities of the language are artificially limited) all languages which have a native char datatype and can thus avoid allocations? Isn't that exactly the result we expect to see when comparing static vs dynamic languages?

smarr commented 7 months ago

single-char strings might not be idiomatic to C++ programmers...Though, the concern of comparability is more important to me in this cases.

Isn't this a similar issue like the "by value" vs "by reference" topic?

Yeah, so, that's a bit of a debate to be had. When I thought about it, I ended up on "the other side" I suppose.

My reasoning was that by-value vs by-reference is something a compiler/runtime system may optimize at least to a degree. Of course, many caveat apply. Though, I am thinking here of escape analysis, scalar replacement, automatic object inlining, and storage strategies. Compared to the char vs string example, these are transparent optimizations. Some runtime system will indeed also be able to chose char as a runtime representation for single character strings. This is certainly something I could implement in TruffleSOM quite easily. However, it's an optimization the runtime needs to make, because it changes what exactly can be represented and which operations can be performed.

But shouldn't the features of the core language be honored when available, especially when performance relevant?

Well, only to a degree. The most important principle for me is comparability, and "idiomatic language use" comes for me only after that.

but should we then punish all languages which have a native char datatype and can thus avoid allocations?

Why is this a punishment? There may be perfectly valid reasons to end up with idiomatic code that has such challenges, i.e., using std::string instead of char that would benefit from a compiler optimizing things to be closer to char-based performance.

Isn't that exactly the result we expect to see when comparing static vs dynamic languages?

No? And to me that's also not the question. Instead, I want to see where compilers can learn from each other.

smarr commented 7 months ago

Perhaps to put it another way: One could ask the question of what can we squeeze out of a language when using all its specific tricks. I believe that's roughly the question behind the Computer Language Benchmarks Game.

Though, I'd rather try to turn the question on its head and ask: what do we need to teach our compilers to get the best possible run time performance for code written in a "reasonable subset" that happens to be comparable across different languages.

Unfortunately, an entirely different question is then "How relevant is this subset to application performance?". I am not sure how to really approach that question though.

rochus-keller commented 7 months ago

The most important principle for me is comparability, and "idiomatic language use" comes for me only after that.

This inevitably leads to the question of what we expect from a cross-language benchmark. Since programming languages can differ not only in their syntax, but also in their fundamental design concepts, for me personally it is very interesting what the cost is of these specific design concepts. E.g. treating everything as objects exchanging messages looks like a very attractive approach; but I would like to understand the trade-off; Awfy gives me a tool to quantify this tradeoff; of course this only works if each implementation of the benchmark actually makes use of the design concepts of the specific language in use. I could e.g. see how the same application performs in Ada, which is interesting because Ada offers means to avoid dynamic allocation alltogether, combined with strict typing. If the Ada implementation would be forced to look like Java because of comparability rules, the comparison would not be helpful, because this wouldn't make use of the specific design concepts Ada offers. On the other hand, neither a fragile construct making use of all compiler specific backdoors no Ada developer in his right mind would ever write, just to be as fast as possible, would neither lead to a helpful comparison. That's why I like the approach of Awfy, to just use language features and implement collections as part of the benchmark, not using libraries the design and optimization of which is not controllable nor comparable. The goal is to understand the cost and benefit of language features, not to be as fast as possible (as in the CLBG, which lacks specific benchmark rules). Therefore, from my point of view, the benchmark is most useful, if a language can be used as conceived by its designer. Since it is not known what languages will exist in future and used to implement the benchmark, the rules must be adapted to accomodate the specific design concepts of each new language. Even different versions of a language can require different benchmark rules, as e.g. with C++; and there might even be different "schools" per language or version that have completely different views on how the language should be used (as again with C++). The benchmark should therefore also allow for specific implementations for different schools, insofar as they are clearly identifiable and specifiable. Only in this way I could e.g. compare my "modern" C++98 approach with a "modern modern" policy based header-heavy C++-17 design, or understand why by the end of the day a C++ implementation is actually faster than an equivalent Smalltalk implementation. If the benchmark forces C++ to look like Java or JavaScript instead, the result is not representative.

Sorry for the long response.

PS: concerning the JSON benchmark and "current" as char vs std::string, the conclusion that the former is twice as fast as the latter is a valuable result from my point of view and should be possible to achieve within the benchmark rules.

PS2: most modern languages support some kind of conditional compilation which could be used to maintain a "core language version" in parallel with as many versions using specific features of the language at hand as required.

smarr commented 7 months ago

This inevitably leads to the question of what we expect from a cross-language benchmark. Since programming languages can differ not only in their syntax, but also in their fundamental design concepts, for me personally it is very interesting what the cost is of these specific design concepts. E.g. treating everything as objects exchanging messages looks like a very attractive approach; but I would like to understand the trade-off; Awfy gives me a tool to quantify this tradeoff; of course this only works if each implementation of the benchmark actually makes use of the design concepts of the specific language in use. I could e.g. see how the same application performs in Ada, which is interesting because Ada offers means to avoid dynamic allocation alltogether, combined with strict typing. If the Ada implementation would be forced to look like Java because of comparability rules, the comparison would not be helpful, because this wouldn't make use of the specific design concepts Ada offers. On the other hand, neither a fragile construct making use of all compiler specific backdoors no Ada developer in his right mind would ever write, just to be as fast as possible, would neither lead to a helpful comparison. That's why I like the approach of Awfy, to just use language features and implement collections as part of the benchmark, not using libraries the design and optimization of which is not controllable nor comparable. The goal is to understand the cost and benefit of language features, not to be as fast as possible (as in the CLBG, which lacks specific benchmark rules). Therefore, from my point of view, the benchmark is most useful, if a language can be used as conceived by its designer. Since it is not known what languages will exist in future and used to implement the benchmark, the rules must be adapted to accomodate the specific design concepts of each new language. Even different versions of a language can require different benchmark rules, as e.g. with C++; and there might even be different "schools" per language or version that have completely different views on how the language should be used (as again with C++). The benchmark should therefore also allow for specific implementations for different schools, insofar as they are clearly identifiable and specifiable. Only in this way I could e.g. compare my "modern" C++98 approach with a "modern modern" policy based header-heavy C++-17 design, or understand why by the end of the day a C++ implementation is actually faster than an equivalent Smalltalk implementation. If the benchmark forces C++ to look like Java or JavaScript instead, the result is not representative.

The key point I take from all these considerations is that there can't be a single true set of rules that covers all interesting questions. And in many ways, I think that is natural, because all the different aspects can't be reduced to a single number in a useful way. (Though, it also then raises the question of whether the benchmarks currently included cover enough ground. But let's put that aside for the moment.)

Since you mention collections, there's a variant of AWFY for the dynamic languages. It's existence in a way acknowledges the problem. This version uses the builtin collections as one would in an idiomatic implementation, instead of relying on a common core library: https://github.com/smarr/are-we-fast-yet/tree/awfy-dynamic-languages

I think this is a very useful variant, with more "idiomatic" code. However, this comes at a cost: we cannot guarantee anymore that execution is deterministic, because relying on builtin hash maps for instance, my use non-deterministic hash criteria, like object memory addresses.

In a similar vain, the question of "fully heap allocated" vs. "value/stack allocation" only in a context of C++, Ada, or other languages makes perfect sense. And indeed, because the languages may make such concepts explicit, it may not make sense to teach the compiler all possible tricks, and instead rely on the developer to make educated decisions. Of course, one could have both versions/style and compare how far compilers can take it.

With this all said, I think for the "standard" AWFY benchmarks, that is, the main branch, I would like to make a reasonable-enough decision, which walks an interesting enough path between the extrema. Simply because I think we can't really cover all possible variants.

concerning the JSON benchmark and "current" as char vs std::string, the conclusion that the former is twice as fast as the latter is a valuable result from my point of view and should be possible to achieve within the benchmark rules.

My perspective is still that it is more useful for my purposes to have a single well-enough defined and small core language.

most modern languages support some kind of conditional compilation which could be used to maintain a "core language version" in parallel with as many versions using specific features of the language at hand as required.

I think there's indeed room for various versions, though, I personally do not want to volunteer to maintain those additional versions. There are too many perfectly reasonable goals one may have, but I think, for my own sanity, I need to stick to a very small set, that I realistically can handle...

smarr commented 7 months ago

The PR is now merged. Things can always be changed at a later point.