C++ version of the benchmark

rochus-keller commented 1 year ago

In case anyone is interested, here is a C++98 implementation of the benchmark: https://github.com/rochus-keller/Are-we-fast-yet/tree/main/Cpp.

I have tried to apply the guidelines mutatis mutandis, and made a compromise between how a C++98 developer would have implemented it, and being faithful to the existing Java code as far as not wasting too much performance. I also tried to avoid unnecessary rewriting. Basically we can discuss all decisions; we just have to commit to something.

There is also an existing Oberon+ version and currently I'm planning for a FreePascal version.

smarr commented 1 year ago

@rochus-keller would you want this port to be integrated?

A student is actually currently also working on a C++ port of the benchmarks. He made good progress already, but isn't quite complete yet.

One of the questions we wanted to look at is what reasonable memory management strategies are. At some point, we might want to compare the different approaches also to your port. I quickly looked at Storage and Towers. Looks to me like you're doing explicit management, with a use of destructors to deal with sub trees. Did you encounter any of the benchmarks needing any other strategies?

Thanks!

rochus-keller commented 1 year ago

would you want this port to be integrated?

It's up to you; I use it as is; for integration with your harness additional work would be necessary.

A student is actually currently also working on a C++ port

So I guess it will be some recent C++ standard version, which is useful in any way; my port is in C++98, since everything from C++11 onwards (with some exceptions in C++17 and C++20) is essentially syntactic sugar and implementable in C++98 which can therefore be seen as the core language. I look forward to see the measured performance differences of the two C++ benchmark implementations.

what reasonable memory management strategies are

I used all sorts of manual memory management strategies, plain new/delete/delete[], reference counting and helper arrays where I just put the instances; I also changed the API at some places to avoid Vector copies; and I kept dynamic array allocations (instead of stack allocation) at some places (when not too expensive) to remain similar to the Java implementation from which I derived my one. I also allowed Vector with no initial storage allocation to avoid expensive copy construction when using Vector by value.

I didn't use standard library features for memory management because of your guidelines and because they give no advantage (just memory overhead).

you're doing explicit management, with a use of destructors to deal with sub trees

Right; that's ideomatic in C++.

Did you encounter any of the benchmarks needing any other strategies?

As described above.

EDIT: Dictionary, JsonValue, List/Element and Richards/RBObject are using reference counting, mostly because it was straight-forward to implement; I assume one could also just call delete in the destructor if need be, but I did some measurements and found that the overhead of reference counting compared to no delete at all is only about 13%. DeltaBlue and Havlak use a Vectors to memorize objects to delete after the run and the just iterate over it. I assured with Valgrind that the implementation doesn't leak memory. I don't use any data structures or algorithms from the standard template library; I did an experiment with std::vector in Storage and found it to be significantly slower than my present implementation (didn't investigate the reason); the only feature of the standard library I use in the benchmark are std::cout, std::cerr and std::string.

smarr commented 11 months ago

overview-1

I finally managed to run the numbers. I have to say I am a bit disappointed in C++. I would have expected it to be faster, but perhaps the benchmarks are a little too far out of its comfort zone.

The RK ones are yours. HJ is my student's code. Everything is run with -O2 flags and the same settings across the languages. I also changed your harness to match the standard harness.

rochus-keller commented 11 months ago

That's interesting, thanks.

I'm pretty estonished how fast Temurin is. And Node.js 20.3 seems to be a bit slower than 12.16.

Which GCC and CLANG versions did you use?

Is the C++ code of HJ somewhere accessible?

I'm actually considering a C version of the benchmark; or do you already have a student working on it?

Did you have a look at my code otherwise? Do you think it meets the benchmark guidelines? Should I better replace the dynamically allocated by fix size stack arrays e.g. in Queens::queens (which increases the difference to the Java version)?

rochus-keller commented 11 months ago

Meanwhile I found HJ's code: https://github.com/smarr/are-we-fast-yet/blob/wip/cpp/benchmarks/C%2B%2BHJ I assume a lot of the performance difference is due to mostly heap allocated objects and prevalent use of shared_ptr. Will try to run it on my machine and check with Valgrind.

smarr commented 11 months ago

Which GCC and CLANG versions did you use?

Ehm, it's a bit of a mess. I should have mentioned that... it's a bit quick and dirty... I haven't done any benchmarking with these before. So, it's what ever is on the 3 machines.

Each benchmark is run on exactly one machine, but yeah, it mixes between benchmarks:

g++ 9.4.0, 11.4.0 clang++ 18.0.0, 14.0.0, 10.0.0

The version of the code you found is my import of his code, with small modifications. And indeed, he wanted to go with a naive version first that consistently uses shared_ptr.

Did you have a look at my code otherwise? Do you think it meets the benchmark guidelines? Should I better replace the dynamically allocated by fix size stack arrays e.g. in Queens::queens (which increases the difference to the Java version)?

No, I haven't really yet.

Generally, the things we discussed where roughly along the following lines:

if it's an allocation in Java in a method, it can be stack or heap allocated, what ever makes most sense
if it's something stored in a field, you may chose to not use a pointer, but have it part of the outer object (I am not quite fluent in C++ terminology)
generally it's fair to do the "C++ thing" as long as it preserves the optimization challenges to some degree. So, turning a field/member into a local variable would not be good.
use const where possible
use virtual/override where strictly needed

rochus-keller commented 11 months ago

Ok, I see. I use GCC 4.8 on my test machine for everything (and everything I reported in https://github.com/rochus-keller/Oberon/tree/master/testcases/Are-we-fast-yet also run on the same machine).

I now run HJ's version on my machine as a C++11 application. To make this work I replaced all std::any by int (little influence on performance). I also had to fix Storage which was actually not called in the Harness and rendered a wrong result (I just copied my code). Here is what I get: grafik

This seems to correspond to your measurements.

As it seems HJ's code spends most of its time for allocation/deallocation. The C++ standard allocator is known to be rather slow. In case of shared_ptr it's even worse since shared_ptr does an extra heap allocation for the ref counts. I think no class is used by value; everything seems to be heap allocated. I also think that there are unnecessary vector copies where the compiler doesn't automatically infer move operations, but I have to check this in the debugger.

In my version e.g. Havlak and DeltaBlue spend most of their time in allocation/deallocation as well, since these algorithm depend on polymorphic objects; I could replace the allocation scheme by a special allocator or memory pooling, but this would no longer be ideomatic and contradict e.g. with the Java implementation; so I think this is the cost of C++. I also found that Richards spends nearly half of its time in dynamic_cast, which is rather surprising, but yet again the cost of C++; I could replace dynamic_cast with static_cast to gain nearly a factor two, but that would be cheating.

smarr commented 11 months ago

One thing I missed before:

I'm actually considering a C version of the benchmark; or do you already have a student working on it?

No. I think if another student would want to go with a low-level language, I would probably try to get them to use OCaml or Rust or so.

This seems to correspond to your measurements.

Yes, in the full report, results for the individual benchmarks look similar for me, too.

In case of shared_ptr it's even worse since shared_ptr does an extra heap allocation for the ref counts.

Yeah... it was a strategy to get started, and then think took longer to get correct than hoped... will have to see whether we get to the point of optimizing further. Same with the vector copies. He has a long list of things I noticed from mere code inspection. Though, I'd rather he writes up and has text to submit at the end of the project, which is very soon...

I could replace dynamic_cast with static_cast to gain nearly a factor two, but that would be cheating.

Hm. What's your reasoning here? If you know by construction it's ok, and correct?

smarr commented 11 months ago

I also had to fix Storage

Ah, and yes, the Storage benchmark is known to be broken. He was still working on that one.

rochus-keller commented 11 months ago

Hm. What's your reasoning here? If you know by construction it's ok, and correct?

First of all, dynamic_cast corresponds to the casts done in Java; if the cast doesn't work in Java, an exception is thrown; here we would have to check whether dynamic_cast returns nullptr and throw an exception ourselfes if so.

Why not static cast: since the methods get a RBObject pointer, but execute methods of a subclass, but we cannot be sure that an instance of this subclass is actually passed to the method. A better approach would be to declare the method with the subclass instead of RBObject (if possible), or to add the required interface for each use case to RBObject, so we could call polymorphically without explicit cast.

Ah, and yes, the Storage benchmark is known to be broken.

Did vector\<void> actually compile with the students compiler version?

Yes, in the full report, results for the individual benchmarks look similar for me, too.

Can I access the full report somewhere? I would like to know the exact factor between my implementation (GCC version) and Temurin. JVM is known to have a very sophisticated allocator/collector, which might be the main reason why the C++ version is only 40% (?) faster, but this is at the cost of much higher memory consumption.

rochus-keller commented 10 months ago

Update concerning benchmark guideline conformance.

You said:

Generally, the things we discussed where roughly along the following lines: if it's an allocation in Java in a method, it can be stack or heap allocated, what ever makes most sense if it's something stored in a field, you may chose to not use a pointer, but have it part of the outer object (I am not quite fluent in C++ terminology) generally it's fair to do the "C++ thing" as long as it preserves the optimization challenges to some degree. So, turning a field/member into a local variable would not be good.

I made a branch of my C++98 version with some modifications you now explicitly allowed. Here is the commit: https://github.com/rochus-keller/Are-we-fast-yet/commit/63ab1e46e0bbc474feb6068c80947d0a73e5b63a.

The effect on overall performance is negligable, not actually worth the effort. Here are the results: grafik

Permute got ~20% faster and Queens ~50%, but Bounce and Json became slower, the former by ~20%

smarr commented 4 months ago

Closing with #84 being merged. Thanks for the discussions.

smarr / are-we-fast-yet

C++ version of the benchmark #80