Understanding C-Vise Performance with multiple cores and Comparison with C-Reduce

pramodk commented 8 months ago

Hello Martin,

Firstly, big thanks for putting together and keeping up with this invaluable tool 🙌 -- it's been very helpful for us, and I'm sure for many others, when it comes to sorting out compiler bugs in codebases!

This is not a bug/issue but just reaching out to get a better handle on C-Vise and make sure we're getting the most out of it.

Just a bit of background: we started with the C-Reduce to tackle some compiler bugs in our project. But, as some reductions were taking hours n hours, we tried throwing in multiple threads/cores to speed things up. But the execution time didn't show major improvements. Then we stumbled onto C-Vise, wondering if it would run faster. But nope, similar behaviour— no big drop in execution time even with multiple cores.

I've read through the discussions in #41 and #114, also details from John Regehr's blog post about Parallel Test-Case Reduction in C-Reduce. I get the gist of the constraints from the algorithm/parallelization strategy, but I thought I could still ask for more clarity on below points:

Both C-Reduce and C-Vise use clang_delta, with C-Vise's top layer in Python and C-Reduce's in Perl. All good! I am curious how the parallelization scheme in C-Vise differs from C-Reduce. What makes (or is supposed to make) C-Vise more efficient than C-Reduce? (i.e. what is a Super-parallel aspect in C-Vise?)
To make sure my setup is right, I wanted to see if I can hit the performance baseline in the README (instead of our internal test case):

Test-case Size C-Vise Reduction C-Reduce Reduction Speed Up

PR92516 6.5 MB 35m 77m 220%

Test-case	Size	C-Vise Reduction	C-Reduce Reduction	Speed Up
PR92516	6.5 MB	35m	77m	220%

So I did the following:

On Ubuntu 22.4 box, I installed C-Vise (2.4) and C-Reduce (2.10)
I installed GCC commit r278246 mentioned in the Bug 92516
The Ubuntu box has an Intel Core i9-12900K CPU and it’s mostly idle.
The interestingness test and the driver script looked like below:

$ cat bug.sh

#!/bin/bash

compile_output=$(~/install/gcc-r278246/bin/g++ -O3 bug567.cc -std=c++1z 2>&1)

if [[ $compile_output == *"internal compiler error: in vect_schedule_slp_instance"* ]]; then
    echo "Still fails for vectorisation with -O3!"
    exit 0
else
    exit 1
fi

$ cat driver.sh

for ncores in 16 12 8 6 4 2 1;
do
    start_time=$SECONDS
    taskset -c 0-$((ncores - 1)) cvise --timestamp --n $ncores bug.sh bug567.cc    # taskset or numactl to pin the threads 
    elapsed_time=$((SECONDS - start_time))
    echo "==>  --n $ncores took: $elapsed_time seconds"
done

and a similar one for C-Reduce.

The timing from the above execution of C-Vise and C-Reduce look like below:

#threads	C-Reduce (seconds)	C-Vise (seconds)
1	5161	3916
2	4680	4348
4	3557	3624
6	3387	3446
8	3834	3406
12	4798	3424
16	5785	3579

compare-creduce-cvise

i.e.

C-Vise runs more faster with ncores == 1, 12 and 16. But, if we look at the fastest execution then the C-Reduce and C-Vise are quite close. (I repeated the runs a few times)

Another aspect I noticed, in the absence of --n CLI option, the C-Reduce chooses the number of cores 4 whereas C-Vise chooses 16. This means in the default execution (without choosing an appropriate number of threads) I haven’t seen better runtimes using C-Vise than C-Reduce (i.e. C-Reduce with 4 cores is a bit better than C-Vise with 16 cores). And also, with the cpu I have, I don’t think there are any numa issues.

I'm not deep into these tools yet, especially internal implementation-specific details. Just posting my observations out here, hoping you can tell if this is more or less expected or if there's something off in my analysis.

Thanks again for all your work and support!

pramodk commented 7 months ago

I was playing a bit more with another example from our internal project using the NVHPC compiler. I used similar benchmarking scripts mentioned above. Simply looking at the htop, all CPUs are used as expected:

And the the comparison look like below:

i.e.

For this specific example, both CVise and C-Reduce didn't improve reduction time with more cores
CVise is ~ 1.9x slower than C-Reduce when ncores == 1 (and ~1.3x with ncores == 16).

marxin commented 7 months ago

Hey!

Firstly, big thanks for putting together and keeping up with this invaluable tool 🙌 -- it's been very helpful for us, and I'm sure for many others, when it comes to sorting out compiler bugs in codebases!

Thanks, I appreciate any happy user of my tool. To be honest, as I left SUSE some time ago (where I used C-Vise on daily basis as the GCC developer), my interest in the project is not so high. But I'm happy to help you and hopefully improve your use-cases.

What makes (or is supposed to make) C-Vise more efficient than C-Reduce? (i.e. what is a Super-parallel aspect in C-Vise?)

Well, the biggest difference that helped me a lot is that C-Vise can run clang_delta in parallel (compared to C-Reduce) which might lead to a reduction speed-up in cases where clang_delta takes a significant amount of time (modern C++ code snippets). Speaking about 92516, the time is dominated by the compiler, where for the unreduced test-case, it takes ~7s while the clang_delta is only 1s. Apart from that, recently I added to C-Vise that the initial 2 passes flip each other after N=30 successful transformations:

00:00:29 INFO ===< ClangBinarySearchPass::replace-function-def-with-decl (30 T) >===
00:00:33 INFO using C++ standard: c++2b with 11886 transformation opportunities
00:01:02 INFO (10.1%, 6126558 bytes, 140780 lines)
...
00:04:42 INFO skipping after 30 successful transformations

Looking again at the aforementioned test-case, it takes 2800s with 12 threads, while the single-threaded mode takes 3800s, so yes, the speed up is negligible.

I've made a quick analysis and I think it's caused by the algorithm itself: it finds a first successful state (a BinaryState is used in case of ClangDelta passes), that reduces the test-case. Then it waits for all previous states to finish and once done, it makes the reduction based on that state. However, it seems the number of failing states in between these succ. state is quite small (in later passes), and they typically fail fast (a compilation error + -Wfatal-errors).

The second problem (similar to one I addressed with N=3 succ. transforms): some passes make small improvements and it would be better to move to another pass (try 'S' keystroke during the run to make a human intervention):

00:13:33 INFO (70.3%, 2024657 bytes, 62746 lines)
00:13:35 INFO (70.3%, 2024240 bytes, 62741 lines)
00:13:36 INFO (70.3%, 2024004 bytes, 62736 lines)
...
00:18:45 INFO (70.8%, 1986991 bytes, 61990 lines)
00:18:46 INFO (70.8%, 1986960 bytes, 61989 lines)
00:18:48 INFO (70.8%, 1986934 bytes, 61988 lines)
00:18:49 INFO (70.8%, 1986864 bytes, 61986 lines)
00:18:49 INFO ===< LinesPass::0 >===
00:18:52 INFO (70.9%, 1979778 bytes, 3329 lines)
00:18:54 INFO (71.0%, 1975236 bytes, 3219 lines)
00:18:55 INFO (71.4%, 1946537 bytes, 3109 lines)
00:18:56 INFO (71.5%, 1941087 bytes, 2999 lines)
00:18:57 INFO (71.6%, 1936804 bytes, 2944 lines)
00:18:58 INFO (71.7%, 1930422 bytes, 2889 lines)
00:19:00 INFO (71.7%, 1924734 bytes, 2834 lines)
00:19:01 INFO (76.9%, 1570311 bytes, 2779 lines)
00:19:02 INFO (77.0%, 1567687 bytes, 2724 lines)
00:19:04 INFO (77.0%, 1566062 bytes, 2697 lines)
00:19:05 INFO (77.0%, 1565166 bytes, 2670 lines)
00:19:06 INFO (77.0%, 1564674 bytes, 2643 lines)

as seen, 0.5% improvement was achieved in 5 minutes, while the next pass then makes a much bigger leap in a few seconds. One might implement a more clever algorithm that will switch to another pass after the improvement speed is low.

One might also experiment with the pass order a bit where I can imagine LinesPass being run for a limited number of iterations in between the initial clang_delta passes.

Anyway, are you willing to invest some time in the C-Vise project where you could experiment with the ideas that can potentially improve your HPC test-cases? Any chance you can upload a real unreduced test-case I can play with? And how many test-cases do you reduce in some period of time?

Again, appreciate your interest and really valuable feedback for the project.

marxin commented 7 months ago

Log files for PR 92516: threads-1.txt threads-24.txt

marxin commented 6 months ago

@pramodk Just a friendly reminder. I spent quite some time working with C-Vise and C-Reduce, thus I would like to leverage your very useful experience.

dufferzafar commented 5 months ago

@marxin Like Pramod, I've been going over the performance characteristics of CVise & CReduce as well.

But, my use case is --not-c so I'm trying it out with other languages & my own interesting.sh script.

Could you shed some light on how these tools perform in such cases? Should I expect CVise parallelism to perform better?

marxin commented 5 months ago

other languages

What languages do you speak about?

Generally speaking, you can run just cvise-delta.py that performs line-by-line removal and it might be faster than what C-Reduce does. Please test it.

lanza commented 4 months ago

For an extra data point -- running with -n166 cvise grinds to a halt. htop showing nothing happening in any cvise processes for long periods of time. Drop it down to 6 threads and it drastically speeds up. Very strange behavior.

marxin / cvise

Understanding C-Vise Performance with multiple cores and Comparison with C-Reduce #123