Closed MaxSagebaum closed 1 month ago
I meant still at compile time, though. What I was wondering was whether the same things that are being folded by template instantiations could also be folded by generating code that delegates common implementations to shared compile-time (constexpr) functions. I don't have a concrete example though, just brainstorming.
I am think about a solution along the same lines. But until now nothing has come up. We currently have two kinds of tokens in the expressions: stateless and statefull. Stateless tokens are like a character match, they are just simple if conditions that advance the iterator to the next position. Statefull tokens need to retain some information when they evaluate something. The alternative |
needs to know which one was the last. If the current branch comes back unmatched it needs to switch to the next.
Currently, the compiler manages the state for us in the stack frames and tries to optimize it. The other option would be to implement the state management manually. But this would implement something various libraries have already done and spent a lot of time optimizing.
I'm about to push a commit that completes the two optimizations I had in mind for get_declaration_of
: one to remove repeated linear rescans of the symbol table (which I think made it roughly quadratic), and one to cache the results of calls to get_declaration_of
with the same arguments. Together, they seem to make a pretty dramatic improvement.
It took a few days because it turns out that caching the answers exposed some corner cases where calling g_d_o
with the same arguments at different times gave different answers(!). Fortunately the "wrong" answers were in corner cases that seem to have not affected anything in the regression tests, until they were cached which used earlier "wrong" answers in exciting new places that did care about their wrongness. Anyway, there was some bug-fixing to do as well, but I think I got them all and now the results pass regression tests again with caching enabled.
There shouldn't be any cases where these optimizations change behavior, but if you think that might be happening please try running with the -_debug
flag (note the underscore, this is an internal flag that's different from the documented -debug
flag), which will execute both the "before" and "after" g_d_o
logic and abort with a crash dump if they give different answers... shouldn't happen, but if it ever does please let me know.
Commit pushed... I forgot to make the commit reference this PR so here is a link: 6dbd1baf511cc4ea73f527ee7e8da23f2dd3a80f
Ok, I will check later. Was this an all-nighter?
Looks really good :-)
gen | compile | |
---|---|---|
Current (8bdd23d) | 130.09 | 254.51 |
with performance update (6a2a809) | 90.09 | 253.7 |
with second performance update (6dbd1ba) | 10.65 | 236.9 |
Whew😌, great!🥳
I know the major perf issue you're investigating is in the Cpp1 compiler. But with the g_d_o
hot spot out of the way, please do let me know if you notice any next hot spot in cppfront... if there's some other function that's taking a lot of time I'd be happy to take a look!
PS: When you run with cppfront -verbose
, what does it report as the total time vs time in g_d_o
? Is g_d_o
still a significant fraction of the time?
It still seems to be the main driver for the cppfront time:
time cppfront_opt -no-c -no-n -no-s -c -verbose build/pure2-regex-perl.cpp2 -o build/pure2-regex-perl.cpp
build/pure2-regex-perl.cpp2... ok (all Cpp2, passes safety checks)
Cpp1 0 lines
Cpp2 2,434 lines (100%)
Time 10,371 ms
8,613 ms in get_declaration_of_new
10.34user 0.05system 0:10.45elapsed 99%CPU (0avgtext+0avgdata 100288maxresident)
Thanks! I just pushed another commit that adds some more timers inside g_d_o
... could you run with the new -_debug
and let me know the output for your workload please? That will help me focus on improving specific areas.
The measurements seem to be to detailed. They derail the performance of cppfront:
cppfront_opt -no-c -no-n -no-s -c -_debug build/pure2-regex-perl.cpp2 -o build/pure2-regex-perl.cpp
build/pure2-regex-perl.cpp2... ok (all Cpp2, passes safety checks)
Cpp1 0 lines
Cpp2 2,434 lines (100%)
Time 655,386 ms
355,076 ms in get_declaration_of_new - including external caching
354,874 ms in get_declaration_of_new
354,563 ms in get_declaration_of_new - phase 2 backward scan loop
297,028 ms in get_declaration_of - old
253,626 ms in get_declaration_of_new - phase 2a 'advance' part of loop
83,964 ms in get_declaration_of_new - phase 2aa 'std::advance' specifically
617 ms in get_declaration_of_new - phase 2b 'move this' part of loop
248 ms in get_declaration_of_new - phase 1 initial loop
649.38user 0.38system 10:55.48elapsed 99%CPU (0avgtext+0avgdata 106840maxresident)
Even without the -_debug
argument, it seems to be slower:
cppfront_opt -no-c -no-n -no-s -c build/pure2-regex-perl.cpp2 -o build/pure2-regex-perl.cpp
build/pure2-regex-perl.cpp2... ok (all Cpp2, passes safety checks)
17.99user 0.06system 0:18.15elapsed 99%CPU (0avgtext+0avgdata 107096maxresident)
@hsutter Just to add to Max's comment above, I think the get_declaration_of_new
function is (sometimes?) slower than the original.
I was testing with the -verbose -_debug
flags with various small Cpp2 tests, and came across many cases where the gdo_new function is reportedly slower than gdo_old.
My understanding is the -_debug
flag enables execution of both gdo_old and gdo_new. What I observed from testing various small Cpp2 snippets is that gdo_old was always 0 ms
but gdo_new ranged from 1 to 5 ms
.
Here's an example:
<source>... ok (all Cpp2, passes safety checks)
Cpp1 0 lines
Cpp2 67 lines (100%)
Time 10 ms
6 ms in get_declaration_of_new - including external caching
5 ms in get_declaration_of_new - phase 2 backward scan loop
5 ms in get_declaration_of_new
3 ms in get_declaration_of_new - phase 2a 'advance' part of loop
1 ms in get_declaration_of_new - phase 2aa 'std::advance' specifically
0 ms in get_declaration_of_new - phase 2b 'move this' part of loop
0 ms in get_declaration_of_new - phase 1 initial loop
0 ms in get_declaration_of - old
Compiler returned: 0
Repro on Godbolt
[Note that Compiler Explorer caches its results so you may need to insert new code (just comments are sufficient) to force Compiler Explorer to run it fresh.]
Quick ack: I suspect it's the timers. I noticed a significant slowdown for each timer I added (which seems really weird, it's just using std::chrono
in simple ways), and I put several extra timers only in the new g_d_o
so I could get some finer-granularity data from Max. I did notice that slowed it down, temporary to get instrumentation data.
OK, pushed: b77ac91a5d4a321602e2c5288b633e9e6ee8931a
We're now back to full speed.
To get full timing info now requires building with CPP2_DEBUG_BUILD
to keep the timers out of the way.
But even then they're not as bad as above anymore... it turns out that the major culprit was make_shared
(I had changed scope_timer
to return a type-erased shared_ptr<void>
to dynamically turn timing on and off, but it turns out that's very slow).
So an update from my side.
Short summary:
Detailed version: First of all, I changed the test framework. I am just compiling now the regular expressions. No additional overhead from the instantiation of the test evaluation.
I tested how the template engine can substitute common template expressions. I took regular expressions of the kind aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
and changed the first or last value. For the first I have then
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
baaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
caaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
...
for the last I have
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac
...
For case I have 36 expressions a-z0-9
.
The results are (in seconds): | template_gen | code_gen |
---|---|---|
same | 3.87 | 16.54 |
left | 3.14 | 16.93 |
right | 21.28 | 16.78 |
The template engine can only substitute common expressions when the remainder of the expression is the same. In the tests, this can happen quite often, in reality I would say not.
From here, I took the code from code_gen right and modified the generated cpp version. | time in seconds | change |
---|---|---|
code_gen right | 16.78 | base |
01 remove UFCS | 13 | Removed the UFCS template calls. Replaced them with the direct call. |
02 macro for match func | 12.98 | Match function body is now a macro for simpler modification. |
03 remove ctx.fail() call | 12.68 | Stateless matchers are now in a do … while(false) loop and a non-match uses break. To jump to the end. |
04 direct r.match update | 12.94 | Only a small increase but very consistent. The compiler does not like it. |
05 base 03 own match_char function | 12.65 | Do not use cpp2::regex::char_token_matcher but a local one. (Starting to remove cppfront dependencies, so that a C version can be programmed.) |
06 char as argument | 12.86 | Same, only a small increase but the compiler does not seem to like it. Character is no longer a template argument but a function argument. |
07 base 05 explicit template arguments | 12.55 | Remove auto deduction of template arguments. |
It seems that for large code bases, the UFCS macro has a large influence on the compilation time. The other improvements are very minor.
Concerning the UFCS macro: maybe if cppfront can already decide which version to take, that it does not use the macro.
Here are the detailed values of the UFCS compile time changes: | code_gen right | 01 remove UFCS | |
---|---|---|---|
phase parsing | 5.2 | 2.89 | 55.58% |
phase opt and generate | 9.65 | 8.99 | 93.16% |
callgraph functions expansion | 6.57 | 6.69 | 101.83% |
callgraph ipa passes | 2.62 | 1.89 | 72.14% |
parser function body | 1.91 | 0.61 | 31.94% |
template instantiation | 1.96 | 1.31 | 66.84% |
TOTAL | 16.79 | 13 | 77.43% |
I will think about this during the weekend.
An update from my side. As always performance testing is an art. ;)
I wrote a an evaluation test, so that I can also test the runtime. This is sperated from the compilation time, so it does not distort the compilation results. For this I wrote a basic test which does nothing. This gives me the baseline for the runtime. But it occoured to me, that this can give me also the baseline for the compilation. I added two additional steps. One were the cppfront util file is included and one where the same structure is generated but a false match is always returned.
Name | compilation time | evaluation time | compilation time - relative | evaluation time - relative |
---|---|---|---|---|
base_01_only_func | 0.52 | 0.06 | 1.00 | 1.00 |
base_02_cpp2_include | 8.91 | 0.06 | 17.13 | 0.98 |
base_03_cpp2_struct | 9.43 | 0.06 | 18.13 | 0.97 |
It seems that the include of the the header takes a lot of the compilation time. I adjusted the current results with the base_03 results:
Name |
compilation time | evaluation time | compilation time - corrected | evaluation time - corrected | compilation time - relative | evaluation time - relative |
---|---|---|---|---|---|---|
base_03_cpp2_struct | 9.43 | 0.06 | 0.00 | 0.00 | 0.00 | 0.00 |
cleanup_01_remove_UFCS | 13.55 | 0.17 | 4.12 | 0.11 | 1.00 | 1.00 |
cleanup_02_macro_for_func_body | 13.59 | 0.17 | 4.16 | 0.11 | 1.01 | 0.98 |
cleanup_03_remove_ctx_fail | 13.32 | 0.17 | 3.89 | 0.11 | 0.94 | 0.94 |
cleanup_04_direct_match_set | 13.49 | 0.15 | 4.06 | 0.09 | 0.99 | 0.80 |
cleanup_05_b03_local_char_match | 13.21 | 0.17 | 3.78 | 0.11 | 0.92 | 1.00 |
cleanup_06_char_as_argument | 13.23 | 0.17 | 3.80 | 0.11 | 0.92 | 0.99 |
cleanup_07_b05_explicit_templates | 13.25 | 0.17 | 3.82 | 0.11 | 0.93 | 0.99 |
cleanup_08_vec_compare | 9.75 | 0.20 | 0.32 | 0.14 | 0.08 | 1.23 |
This are some preliminary results. I need to add, that the things are run multiple times, so that I can generate an average. But, cleanup_04
seemed to be quite god for runtime but did not change anything for the compilation. cleanup_08
was the last test where I replaced all the if
statements with an std::equal
. This helps compilation, but hits the runtime. Maybe I can improve this a little bit, this was just the first test.
Have a nice week!
Thanks! If you get a chance, I'm curious how much last night's commit might further help your cppfront compilation times, but no rush -- I know the main thing is the Cpp1 compilation times.
Speaking of which, this is very interesting data:
the cppfront util file is included ... [and] takes a lot of the compilation time
Aha, do you mean cpp2util.h
? If yes, I suspect maybe you're getting hit with the cost of convenience-#include
of all the standard header files, which happens if you either specify -include-std
directly, or are falling back to those semantics because of specifying -import-std
(or -pure-cpp2
which implies -import-std
) on a compiler that's not recognized as supporting modules yet and so effectively does -include-std
.
What happens when you compile without any of those three flags... i.e., compile as mixed Cpp1/Cpp2 and explicitly #include
whatever the standard headers you need that cpp2util.h
doesn't pull in unconditionally?
Looks good: | gen | compile |
---|---|---|
Current (8bdd23d) | 130.09 | 254.51 |
with performance update (6a2a809) | 90.09 | 253.7 |
with second performance update (6dbd1ba) | 10.65 | 236.9 |
version d136153 | 1.73 | 240.68 |
But I get an error when I run it with -_debug:
time cppfront -no-c -no-n -no-s -c -_debug build/pure2-regex-perl.cpp2 -o build/pure2-regex-perl.cpp
build/pure2-regex-perl.cpp2...
Internal compiler error - see cppfront-ice-data.out for debug information
Command exited with non-zero status 1
That's great to know it helped the cppfront perf issue in your code. I'll consider g_d_o performance fixed for now.
Thanks for reporting this with debug output:
But I get an error when I run it with -_debug:
time cppfront -no-c -no-n -no-s -c -_debug build/pure2-regex-perl.cpp2 -o build/pure2-regex-perl.cpp build/pure2-regex-perl.cpp2... Internal compiler error - see cppfront-ice-data.out for debug information
I don't see build/pure2-regex-perl.cpp2
in the PR, could you send that file so I can see what the problem code is?
Thanks!
I don't see build/pure2-regex-perl.cpp2 in the PR, could you send that file so I can see what the problem code is?
I pushed the branch, which I use to to the performance tests with the main branch. https://github.com/MaxSagebaum/cppfront/tree/temp_branch/regex_main_perf_test
Under regresssion_tests
you have pure2-regex.cpp2
. I also attached it.
pure2-regex.cpp2.txt
Aha, do you mean cpp2util.h? If yes, I suspect maybe you're getting hit with the cost of convenience-#include of all the standard header files, which happens if you either specify -include-std directly, or are falling back to those semantics because of specifying -import-std (or -pure-cpp2 which implies -import-std) on a compiler that's not recognized as supporting modules yet and so effectively does -include-std.
That is not the issue. It only includes the necessary headers for cppfront. The full include is not happening. I tried to used the standard module but on my system this seems not to work/ be installed.
I modified the tests so that compilation and runtime are measured 10 times. The standard deviation for the compilation is about 0.2
seconds and for the runtime it is 0.01
seconds. So the results are quite solid.
I reorganized the tests a little bit so the list is a little bit different.
6
seconds to the compilation time. I think that I can reduce the number of places this would be inserted. But I would strongly vote for an options to deactivate it for some code parts. I will open an Issue about this.I will try now to aggregate char matchers in the regex metafunctions. Lets see how much this gives us for the test suite.
name | compile | compile adjusted | compile relative | run | run adjusted | run relative |
---|---|---|---|---|---|---|
base_03_cpp2_struct-compile | 14.14 | 0 | 0 | 0.23 | 0 | 0 |
cleanup_00_base-compile | 25.48 | 11.35 | 1 | 0.32 | 0.09 | 1 |
cleanup_01_remove_UFCS-compile | 19.49 | 5.36 | 0.47 | 0.33 | 0.1 | 1.09 |
cleanup_02_macro_for_func_body-compile | 19.35 | 5.22 | 0.46 | 0.32 | 0.09 | 1.01 |
cleanup_03_remove_ctx_fail-compile | 19.14 | 5 | 0.44 | 0.33 | 0.1 | 1.07 |
cleanup_04_direct_match_set-compile | 19.33 | 5.19 | 0.46 | 0.32 | 0.09 | 0.99 |
cleanup_05_b03_local_char_match-compile | 18.89 | 4.75 | 0.42 | 0.32 | 0.09 | 0.97 |
cleanup_06_char_as_argument-compile | 19.19 | 5.05 | 0.45 | 0.31 | 0.08 | 0.95 |
cleanup_07_b05_explicit_templates-compile | 19.12 | 4.98 | 0.44 | 0.32 | 0.09 | 1.03 |
cleanup_08_one_check_for_size-compile | 17.89 | 3.76 | 0.33 | 0.33 | 0.1 | 1.08 |
cleanup_09_update_of_pos_only_once-compile | 17.09 | 2.95 | 0.26 | 0.34 | 0.11 | 1.2 |
cleanup_10_vec_compare-compile | 14.71 | 0.57 | 0.05 | 0.33 | 0.1 | 1.07 |
cleanup_11_local_array-compile | 14.56 | 0.42 | 0.04 | 0.4 | 0.17 | 1.94 |
cleanup_12_while_compare-compile | 14.49 | 0.35 | 0.03 | 0.43 | 0.2 | 2.19 |
cleanup_13_b11_memcmp-compile | 14.73 | 0.59 | 0.05 | 0.41 | 0.18 | 1.99 |
cleanup_14_b12_vectorized_for-compile | 14.68 | 0.54 | 0.05 | 0.32 | 0.09 | 1.02 |
Resolved:
But I get an error when I run it with -_debug:
time cppfront -no-c -no-n -no-s -c -_debug build/pure2-regex-perl.cpp2 -o build/pure2-regex-perl.cpp build/pure2-regex-perl.cpp2... Internal compiler error - see cppfront-ice-data.out for debug information
Yup, I've confirmed this is a case where the old lookup was wrong and the new lookup is correct, so -_debug
is flagging the change in result but the change is desirable. Here's a distilled version of the example:
f: () = {
if true {
group := line_3;
}
else if true {
group := line_6;
}
else {
r.group(line_9);
}
}
What the diagnostic is saying is "hey, get_definition_of
for token group
on line 9 now returns 'not found' and before it used to return the declaration on line 3"... but that previous answer was wrong, and now it's correct.
I tried to used the standard module but on my system this seems not to work/ be installed.
If you're using MSVC, they're not prebuilt for you, but it's easy to build them... easy once you find the right documentation that is, but the information is buried in this tutorial (sorry about this, we're working on improving the documentation).
Just open a VS native tools command prompt, and run the following two commands:
cl /std:c++latest /EHsc /nologo /W4 /c "%VCToolsInstallDir%\modules\std.ixx"
cl /std:c++latest /EHsc /nologo /W4 /c "%VCToolsInstallDir%\modules\std.compat.ixx"
That's it! Four files will magically appear: std.ifc
, std.obj
, std.compat.ifc
, std.compat.obj
.
If you when compile you get errors that say the flags are mismatched, it just means you need to run the above commands with the same flags you're building your project with, that's all. The most likely flag-mismatch message you'll see is that _DLL
is missing, in which case just add /MD
to the above command lines and all should be well.
It's worth doing this because using the standard library this way is really blazing fast.
up, I've confirmed this is a case where the old lookup was wrong ...
Nice. If it is not too problematic, we can keep the old logic behind a precompiler macro and add a removal tag for 10.06.2025.
If you're using MSVC, they're ...
Thanks for the tutorial, but unfortunately not. I am on Fedora Linux with mostly gcc. I tried to search for a tutorial to build it manually but did not found anything. The closest discussion is https://github.com/llvm/llvm-project/issues/73089.
I just skimmed the discussion but it seems because of a lack of standard location in Linux they did not force it.
Anybody has some information about using the std
library module under linux?
Anybody has some information about using the
std
library module under linux?
I published a small example for Linux and Clang++-18 (g++ 14 does not support import std;
yet) using CMake 3.30.
Example of using C++ modules with clang and cmake on Linux
Thanks @MaxSagebaum !
- The UFCS macro and template adds
6
seconds to the compilation time. I think that I can reduce the number of places this would be inserted. But I would strongly vote for an options to deactivate it for some code parts. I will open an Issue about this.
This is now done in 9648191ecf0000696024bb54ecb8202373a922e9 ... if you use the ..
syntax, such as str..append("xyzyy");
, it will only consider member functions. This should enable you to avoid UFCS by emitting ..
when you know you want a member, and based on your numbers it seems like it should help compilation performance a lot.
If I understand right, this should also benefit the as-of-April first implementation, which would be interesting!
Thanks @hsutter for implementing this. I went ahead and checked it. But the results where quite disappointing. It seems that UFCS adds 5 seconds but it does not depend on the amount used. This comes as a surprise to me. I thought that it would scale with the number of usages:
With UFCS | No UFCS | |
---|---|---|
template based on 6156b51 | 125.79 | 120.52 |
code gen based on 6156b51 | 217.9 | 223.46 |
code gen based on 6156b51 with compilation improvements | 234.1 | 230.14 |
It also seems that the compilation improvements do not apply to the regular regex matchers.
I did another test, where I prevented the template engine from substituting the same regular expressions: kind | time in seconds |
---|---|
template with substitution | 123.44 |
template no substitution | 259.57 |
So it seems, the advantage of the template version comes from the substitution of common templates.
Actually, I do not know how to proceed from here. I doubt that I will be able to reduce the compile time of the generated code version by a lot. It seems the compilation time comes from the problem itself and not from the template engine/code parsing. The only option I currently see, is to generate optimized code. This would boil down to genere assembler code. This could probably be done but it is outside of the scope of my experience/ time frame.
The advantage of the template version is the common template substitution. The code generation version is a stress test for cppfront and the implementation of new features is more local. It also received more refactoring which makes it a little bit simpler to read.
What are your thoughts on this?
A side node: I checked if there are some outlines in the 600 regular expressions from the perl test suite. But there where none. 10 regexes always took around 9 seconds to compile. (I did not subtract the general overhead here.)
Thanks!
It seems that UFCS adds 5 seconds but it does not depend on the amount used.
Interesting... in that case, could you help me understand better what is the exact difference you measured earlier when you mentioned:
The UFCS macro and template adds 6 seconds to the compilation time. ... cleanup_01_remove_UFCS-compile
When you did cleanup_01, what was the exactly diff... can I see the diff somewhere so I can try it out too and run some experiments? (Edited to add:) If I can repro the overhead then maybe I can find another way to eliminate it.
Regardless of what further progress might be possible, thank you again for all the results you've already achieved here and the expertise you've put into this. Much appreciated! 🎉 What's already here is a great updated result to present at CppCon, and we should look at merging this to main
as soon as you tell me it's ready. 🙏
Regardless of what further progress ...
Thanks, really appreciate the words.
Interesting... in that case, could you help me understand better what is the exact difference you measured earlier when you mentioned:
I took the generated code and modified it by replacing the UFCS call with the member call. My test setup looks like the following:
# Compiling cppfront without any checks. (We still include cpp2util.h, which includes cpp2regex.h. This will remove the additional checks from cpp2regex.h.)
build/cppfront_opt -no-c -no-n -no-s reflect.h2 -o reflect.h
build/cppfront_opt -no-c -no-n -no-s regex.h2 -o ../include/cpp2regex.h
g++ -rdynamic -std=c++20 -O3 -g cppfront.cpp -o build/cppfront_opt
# Running the tests
time g++ -std=c++23 -I../../cppfront/include -ftime-report -O3 -g source/base_03_cpp2_struct.cpp -c -o build/base_03_cpp2_struct.o
time g++ -std=c++23 -I../../cppfront/include -ftime-report -O3 -g source/cleanup_00_base_standalone.cpp -c -o build/cleanup_00_base_standalone.o
time g++ -std=c++23 -I../../cppfront/include -ftime-report -O3 -g source/cleanup_01_remove_UFCS_standalone.cpp -c -o build/cleanup_01_remove_UFCS_standalone.o
# Running an automated test
hyperfine -r 5 --export-csv build/bench_cleanup_compile.out -n base_03_cpp2_struct-compile 'g++ -std=c++23 -I/home/msagebaum/Kaiserslautern/Programms/cppfront/regex-tests/compile_tests/../../cppfront/include -ftime-report -O3 -g source/base_03_cpp2_struct.cpp -c -o build/base_03_cpp2_struct.o' -n cleanup_00_base_standalone-compile 'g++ -std=c++23 -I/home/msagebaum/Kaiserslautern/Programms/cppfront/regex-tests/compile_tests/../../cppfront/include -ftime-report -O3 -g source/cleanup_00_base_standalone.cpp -c -o build/cleanup_00_base_standalone.o' -n cleanup_01_remove_UFCS_standalone-compile 'g++ -std=c++23 -I/home/msagebaum/Kaiserslautern/Programms/cppfront/regex-tests/compile_tests/../../cppfront/include -ftime-report -O3 -g source/cleanup_01_remove_UFCS_standalone.cpp -c -o build/cleanup_01_remove_UFCS_standalone.o'
The results from this example: name | mean | mean-adj | main-rel |
---|---|---|---|
base_03_cpp2_struct-compile | 12.74 | 0 | 0 |
cleanup_00_base_standalone-compile | 24.1 | 11.36 | 1 |
cleanup_01_remove_UFCS_standalone-compile | 18.05 | 5.31 | 0.47 |
I always took the total time. I needed to modify the files a little bit, since the char_token_matcher
changed in the newest version. Theses files a compatible with the current regex branch.
cleanup_01_remove_UFCS_standalone.cpp.txt
cleanup_00_base_standalone.cpp.txt
base_03_cpp2_struct.cpp.txt
I did not include the runtime tests. If you want these I can also upload them.
Thanks, I will try to cleanup the MR next week, so that we can start merging it.
Since this is a very long standing MR and has a large history: Do you think it makes sense to squash all the commits?
Yes, I think squashing makes sense.
Also, does rebasing on main
make sense? While testing this branch I noticed it was still using the slow g_d_o.
Also, while trying out this branch to experiment myself with variations of the UFCS macros, I was having trouble compiling it with Clang (a missing typename
qualifier in the generated Cpp1 code that I should work around/fix) and MSVC (ICEs, sigh). I did build with GCC and generate a Cpp1 file from the new tests, but on trying to compile that with GCC it gave me "unknown escape sequence" errors (e.g., for \$
and a few similar ones) that I haven't been able to manually edit to get past.
Today I'll work on doing more UFCS performance experiments by just using main
and generating my own UFCS stress test cases by hand, but at some point I'd like to be able to build this branch again, maybe a short pair-programming session would do it.
Thanks again!
Maybe we can convince compiler implementors to add __builtin_hsutter_ufcs()
, or we can contribute it ourselves if its not extraordinarily complex to do so 😛
Also, does rebasing on main make sense? While testing this branch I noticed it was still using the slow g_d_o.
Ok, I will also do the rebase.
Also, while trying out this branch to experiment myself with variations of the UFCS macros, I was having trouble compiling it with Clang (a missing typename qualifier in the generated Cpp1 code that I should work around/fix) and MSVC (ICEs, sigh). I did build with GCC and generate a Cpp1 file from the new tests, but on trying to compile that with GCC it gave me "unknown escape sequence" errors (e.g., for \$ and a few similar ones) that I haven't been able to manually edit to get past.
Sorry for that, after the refactor that was no very high on my todo list. I will try to make it warning free and also do a few iterations with the test suite. This should then take care of most of the errors.
I will be away until sunday, so after that I will start working on it.
One other question: The regex regression test currently consists of all the perl tests I have enabled. This are 600 regular expressions and takes about 4 minutes to compile. One option would be to split it into several smaller tests, the other option would be to make a selection and test most features, but not every corner case.
What do you think?
or we can contribute it ourselves if its not extraordinarily complex to do so 😛
Interesting suggestion! I hadn't thought about that, but presumably a __builtin
would be simpler and faster because it could Do the Right Thing directly from the AST and not need the macro nest and if constexpr( requires{
indirect machinery.
Are you saying you're interested in trying? 👼
The regex regression test currently consists of all the perl tests I have enabled. This are 600 regular expressions and takes about 4 minutes to compile. One option would be to split it into several smaller tests, the other option would be to make a selection and test most features, but not every corner case.
My gut reaction is that I think having all those tests is quite valuable, and a prod for me to improve compile times. The only reason I made the -partial
subset in this branch was to disable the ones I wasn't able to get to compile at the time.
The transpile time of cppfront is no longer an issue it is about 10 seconds. The main compile time of 4 minutes comes from gcc (or other compilers). As I stated, this comes not from the template instantiation but from the optimization of the code. So in order to reduce this, we would have to generate assembler code ;-).
I would suggest, to build two groups of tests for the regex. One smaller test of eg 50 or so regexes which runs in the regular regression test. The remainder would be run with a special option and always evaluated in the regression tests on github. I will put them in multiple files, so if we can run the tests in parallel, there would not be one big long running test.
The transpile time of cppfront is no longer an issue it is about 10 seconds. The main compile time of 4 minutes comes from gcc (or other compilers).
Right, that's what I meant ... my understanding is that now the "hot spot" is the UFCS macros in the Cpp1 compiler. That's what I'm taking a look at now as the next bottleneck. Thanks for pointing it out!
Right, that's what I meant ... my understanding is that now the "hot spot" is the UFCS macros in the Cpp1 compiler. That's what I'm taking a look at now as the next bottleneck. Thanks for pointing it out!
Ok, keep me posted on the benchmarks. Since I have spent already some time on this, I might give you some insight if they are sound. Because, as you know, proper bench-marking is an art. ;)
[EDITED TO ADD: Case 4, and Case 4 note] [EDITED TO ADD: Case 5]
I have some initial data... I tried three cases:
Case 1: No UFCS, just call x.y(z);
Case 2: "Partial" UFCS, call a stripped-down version of what UFCS generates, [](auto&& o, auto&& ...p) { o.y(std::forward<decltype(p)>(p)...); } (x, z);
Case 3: Call actual CPP2_UFCS(y)(x, z);
1&3 Mix: Call a 50/50 mix of case 1 and case 3.
And to compare against a non-UFCS baseline:
Case 4: A sample short piece of 'regular' code, { std::vector<int> v(100); for(auto i:v){} }
Case 5: A call to a templated invoker function as possible alternative to using IILE, invoke(x, z);
I ran a test that does 10,000 function calls:
average of 3 runs (including cppfront time)
on the same notebook (Linuxes via WSL2)
Case GCC 10 GCC 14 Clang 12 MSVC 19.40
(Ubuntu) (Fedora) (Ubuntu) (Windows)
---- ---------- ---------- ---------- ----------
1 1,090 1,230 760 1,050
2 10,700 6,500 1,940 3,830
3 11,700 15,000 4,900 11,100
1&3 4,100 4,930 2,630 5,140
mix
4 9,200 9,900 3,200 7,800
5 1,150 1,430 790 1,070
Command lines used:
g++-10 -std=c++20 -pthread -Wold-style-cast -Wunused-parameter
g++-14 -std=c++20 -pthread -Wold-style-cast -Wunused-parameter
clang++-12 -std=c++20 -pthread -Wold-style-cast -Wunused-parameter
cl -W3 -std:c++latest -MD -EHsc
Notes:
The overhead is proportional to the number of UFCS calls. So the recent update to add ..
to suppress UFCS ought to enable code to fully opt out of the overhead?
A lot of the UFCS extra cost comes from immediately invoked lambda expressions (IILEs): ~90% for GCC 10, to ~35% for Clang and MSVC.
I don't know how much of the non-IILE cost would go away if IILEs went away (i.e., is amplified by IILEs being present).
Since so much of the cost is IILEs, it seems like if cppfront wanted to improve this the best thing would be a significant surgery to have cppfront emit directly what currently is being handled as an IILE.
That's what I have so far...
[ADDED] Case 4 note: On all compilers, it looks like the time to compile a UFCS call is ~1 millisecond (on my machine), which is ~50% more than compiling a simple two-line block (declaring a vector<int>
and range-for
over it). That doesn't seem to be inordinately expensive in itself, but it does add up as 1,000 UFCS calls take about 1 second to compile (on my machine) which can become significant... so I want to watch this effect as we progress and gather more data with ever-larger files.
Thanks again @MaxSagebaum for the stress test with the higher-volume generated code, and pointing out the UFCS cost as something worth investigating and worth keeping an eye on!
Interesting suggestion! I hadn't thought about that ...
Many standard utilities are implemented like that in compilers, for instance: __builtin_bit_cast
... I would expect a proper extension that implements such a built-in to perform as good as how the theoretical adopted paper would.
Are you saying you're interested in trying? 👼
Not closed to the idea of trying, but certainly not now. I've got bigger fish to fry with the reflection API and metafunctions at the moment 😄
I've updated my previous comment in-place with an additional Case 4 and a conclusion.
Summarizing the delta here:
To compare against a non-UFCS baseline, I added:
{ std::vector<int> v(100); for(auto i:v){} }
Adding in a line for Case 4, in the time to compile 10,000 copies of each case:
Case GCC 10 GCC 14 Clang 12 MSVC 19.40
(Ubuntu) (Fedora) (Ubuntu) (Windows)
---- ---------- ---------- ---------- ----------
1 1,090 1,230 760 1,050
2 10,700 6,500 1,940 3,830
3 11,700 15,000 4,900 11,100
1&3 4,100 4,930 2,630 5,140
mix
4 9,200 9,900 3,200 7,800
An additional note is:
Case 4 note: On all compilers, it looks like the time to compile a UFCS call is ~1 millisecond (on my machine), which is ~50% more than compiling a simple two-line block (declaring a vector<int>
and range-for
over it). That doesn't seem to be inordinately expensive in itself, but it does add up as 1,000 UFCS calls take about 1 second to compile (on my machine) which can become significant... so I want to watch this effect as we progress and gather more data with ever-larger files.
Here's the test code I used (for anyone else who wants to run experiments, and so I can find the code here again for my own use in the future):
Should we move the UFCS performance stuff to its own issue, for easier tracking and referencing? That way it might have more visibility as well and other people who are knowledgeable on the topic could chip in, also avoids hijacking this PR further. Thoughts @hsutter ?
Should we move the UFCS performance stuff to its own issue, for easier tracking and referencing?
I maybe should have done that, but I think we've bottomed out for now and I haven't been able to repro it apart from this PR's regex compile time specifically. So let's open a new issue if it comes up again, which can reference this. And in the meantime I'm interested in helping regex performance.
@hsutter I had a look at the results and your conclusions are sound. Just a few questions.
auto invoke(auto&& o, auto&& ...p) {
return o.y(std::forward<decltype(p)>(p)...);
}
-ftime-report
. This could show if the time comes from optimization or instantiation.I pushed the current version to get the reports from the pipeline. @hsutter sorry I force pushed you commit away. Thanks for the changes. I fixed them already during my roundup, so they were already included.
On my platform there is only one warning left. The first fix for that one did not work. So I want to see what the other environments say.
- In the compiling options you did not specify an optimization level.
Right, I left it at the default optimization level. I could re-test at higher optimization levels.
- How about a 5-th case which does not use an IILE but instead uses a regular template function? E.g.:
auto invoke(auto&& o, auto&& ...p) { return o.y(std::forward<decltype(p)>(p)...); }
Good idea, thanks! Added, above. The results are about the same as an ordinary member function call, which is good. What I haven't tried, however, is replacing the IILE with such an invoke
with all the other UFCS machinery. That is an experiment I will try. (But for the purposes of this PR, ..
should be a workaround to eliminate this overhead.)
- Did you had a look at the breakdown of the compilation time? For gcc this is
-ftime-report
. This could show if the time comes from optimization or instantiation.
For the full-UFCS case 3, GCC 10 reports these as the main costs (trimming everything below 10%):
Time variable usr sys wall GGC
phase parsing : 11.19 ( 90%) 1.92 ( 90%) 13.40 ( 90%) 1064838 kB ( 83%)
phase opt and generate : 1.25 ( 10%) 0.21 ( 10%) 1.47 ( 10%) 201055 kB ( 16%)
|name lookup : 7.23 ( 58%) 0.55 ( 26%) 8.10 ( 54%) 15069 kB ( 1%)
|overload resolution : 7.70 ( 62%) 0.55 ( 26%) 8.38 ( 56%) 170012 kB ( 13%)
parser function body : 2.53 ( 20%) 0.28 ( 13%) 2.68 ( 18%) 444954 kB ( 35%)
template instantiation : 7.60 ( 61%) 0.56 ( 26%) 8.28 ( 55%) 179780 kB ( 14%)
@hsutter The branch should now be ready for review.
The remaining failures in the regression tests have two categories:
match_return
fails with clang-15. I created a bug report for this. https://github.com/hsutter/cppfront/issues/1146 I can program around it or we fix this issue first.There are a few TODOS
left in the regex code base. These are mostly reminders on issues, missing cpp2 features, or for myself what I can improve in the future.
Thanks, Max!
- Clang 15 construction: Constructing a
match_return
fails with clang-15. I created a bug report for this. [BUG] Wrong generation of construction for @struct and clang-15 #1146 I can program around it or we fix this issue first.
I just pushed a commit that should fix this -- please let me know if it does the job for you. Thanks for reporting it and suggesting the fix approach.
Thanks for fixing this. I updated the headers. Lets see what the pipeline says.
I will update this overview such that it is easy to grasp the status of the implementation.
Example file:
example.cpp2
Current status and planned on doing
Modifiers
Escape sequences (Complete)
Quantifiers (Complete)
Character Classes and other Special Escapes (Complete)
Assertions
Capture groups (Complete)
Quoting metacharacters (Complete)
Extended Patterns
Lookaround Assertions
Special Backtracking Control Verbs
Not planned (Mainly because of Unicode or perl specifics)
Modifiers
Escape sequences
Character Classes and other Special Escapes
Assertions
Extended Patterns
Script runs