Parallelism doesn't do much

fiesh commented 1 year ago

Sorry if this is a vague description, I can supply additional files to create some proper testcase. However, I felt explaining what's happening might be best.

I'm reducing an interestingness test that basically looks like this:

ulimit -t 30
g++ -c a.ii && ! g++ -c -fimplicit-constexpr a.ii

The point is to find a bug that triggers indefinite runtime and resource exhaustion.

Now I've started this reduction with 64 threads and noticed that every step in at least all the initial passes (haven't gotten past this point yet) take 30 seconds plus a little bit, so basically one execution of the interestingness test. In each iteration, one line is removed. So it seems only one thread wins and the others don't do much. Reducing the invocation of cvise to 4 threads appears to make no difference in the reduction speed, I still get one line every 30 seconds.

Am I fundamentally misunderstanding something, or is this intended behavior and the threads just race?

marxin commented 1 year ago

Hey, thanks for the report. First, please paste here a reduction log so that I can see what happens. How big is your a.ii file? Note the most progress should be achieved by clang_delta passes remove-unused-function and replace-function-decl which run in parallel. It runs multiple int. test runs, and once one successfully finishes, it waits for the others to start before the "winner". Since then, no new tasks are run and the first successful is taken.

Another potential speed-up can be achieved by measuring time for g++ -c a.ii (that should succeed) and then run timeout X g++ -c a.ii -fimplicit-constexpr with X equal to 10 x time of the first command execution.

Generally speaking, there should not be any problem with your interestingness test.

fiesh commented 1 year ago

Ah thanks, I didn't know about timeout.

This is what the reduction log looks like:

% nice -n 20 cvise -n 64 --clang-delta-std c++20 test.sh a.ii
00:00:30 INFO ===< 8414 >===
00:00:30 INFO running 64 interestingness tests in parallel
00:00:30 INFO INITIAL PASSES
00:00:30 INFO ===< IncludesPass >===
00:00:30 INFO ===< UnIfDefPass >===
00:00:31 INFO ===< CommentsPass >===
00:00:31 INFO ===< IfPass >===
00:00:31 INFO ===< LineMarkersPass >===
00:00:31 INFO ===< BlankPass >===
00:00:31 INFO ===< ClangBinarySearchPass::replace-function-def-with-decl (30 T) >===
00:01:03 INFO (1.0%, 139823 bytes, 1789 lines)
00:01:34 INFO (1.1%, 139730 bytes, 1786 lines)
00:02:06 INFO (1.3%, 139433 bytes, 1786 lines)
00:02:37 INFO (1.6%, 138994 bytes, 1786 lines)
00:03:08 INFO (2.5%, 137627 bytes, 1786 lines)
00:03:39 INFO (2.6%, 137535 bytes, 1784 lines)
00:04:10 INFO (2.6%, 137529 bytes, 1784 lines)
00:04:41 INFO (2.7%, 137409 bytes, 1781 lines)
00:05:12 INFO (2.9%, 137173 bytes, 1777 lines)
00:05:43 INFO (2.9%, 137167 bytes, 1777 lines)
00:05:57 INFO Exiting now ...
% mv a.ii.orig a.ii
% nice -n 20 cvise -n 4 --clang-delta-std c++20 test.sh a.ii
00:00:30 INFO ===< 54808 >===
00:00:30 INFO running 4 interestingness tests in parallel
00:00:30 INFO INITIAL PASSES
00:00:30 INFO ===< IncludesPass >===
00:00:30 INFO ===< UnIfDefPass >===
00:00:30 INFO ===< CommentsPass >===
00:00:30 INFO ===< IfPass >===
00:00:30 INFO ===< LineMarkersPass >===
00:00:30 INFO ===< BlankPass >===
00:00:31 INFO ===< ClangBinarySearchPass::replace-function-def-with-decl (30 T) >===
00:01:04 INFO (1.0%, 139823 bytes, 1789 lines)
00:01:35 INFO (1.1%, 139730 bytes, 1786 lines)
00:02:07 INFO (1.3%, 139433 bytes, 1786 lines)
00:02:39 INFO (1.6%, 138994 bytes, 1786 lines)
00:03:10 INFO (2.5%, 137627 bytes, 1786 lines)
00:03:41 INFO (2.6%, 137535 bytes, 1784 lines)
00:04:11 INFO (2.6%, 137529 bytes, 1784 lines)
00:04:42 INFO (2.7%, 137409 bytes, 1781 lines)
00:05:13 INFO (2.9%, 137173 bytes, 1777 lines)
00:05:46 INFO (2.9%, 137167 bytes, 1777 lines)
00:06:04 INFO Exiting now ...

I should say that in the first case, it appears as if 32 threads run simultaneously, while in the second there appear to be 4 threads.

testcase.tar.gz

marxin commented 1 year ago

OK, so the speed of both -n X is pretty much the same, which is kind-of-expected. Please use htop or something similar to see the system utilization. Or use --debug argument and you'll see all the BinaryState modifications that are tried with the int. test.

Looking at your test-case, the g++ invocation w/o the -fimplicit-constexpr finishes in 0.2s, thus I'm running: timeout 0.4 g++ a.ii -c -std=c++20 -w && ! timeout 4 g++ a.ii -c -fimplicit-constexpr -std=c++20'

and the reduction runs pretty smoothly with the following system utilization: usage

fiesh commented 1 year ago

OK, so the speed of both -n X is pretty much the same, which is kind-of-expected.

I guess I have some fundamental misunderstanding here... why is this expected?

marxin commented 1 year ago

Because the fastest step (in your case) can be achieved in 30s and it seems that the tests (reduced by C-Vise) either fail fast (due to a compilation error of the first g++ invocation), or really reach the timeout of 30s. That's why you can't see a difference between 4 and 32 threads.

Does it help?

fiesh commented 1 year ago

Because the fastest step (in your case) can be achieved in 30s and it seems that the tests (reduced by C-Vise) either fail fast (due to a compilation error of the first g++ invocation), or really reach the timeout of 30s. That's why you can't see a difference between 4 and 32 threads.

Does it help?

So they effectively just race? That's my misunderstanding then I suppose, I was assuming some more proper parallelization. If they all race, this behavior makes perfect sense, but using many threads will more often than not be entirely pointless. Do all passes just race? It might be worth explaining that in the documentation?

marxin commented 1 year ago

Well, partially, because the timeout 30 is slowing that down rapidly. What a BinaryPass in C-Vise does is that it attempts to make the reduction of [0-n/2, n/2-n, 0-n/4, n/4-n/2, ...] and once a first interval finishes, it cancels all parallel runs and it waits for all intervals before the one which succeeded. If the success rate is pretty high (your test case), then each step basically takes 30 seconds before a new test case is again split into intervals.

Note, in most cases, when a int. test finishes fast, most of the time is spent in either clang_delta and thus it does not play a role.

Overall speaking, having a long time out can't make the reduction much slower as the nature of C-Vise tool is to reduce a test case incrementally.

fiesh commented 1 year ago

I see, thanks for the explanation. I guess an alternative would be to, instead of taking the first success and waiting for potentially better ones, to combine the successful intervals? Like if line a and line b can be removed individually, there's a high chance they can also be removed both simultaneously?

Anyway, I suppose this is a misunderstanding on my side. I don't know if there's a way to improve the behavior like I suggested or the documentation can be made clearer, but basically there's a large racing component to the parallelism here that just really reduces the usefulness of many threads in some scenarios. So I think this can be closed since it's just my misunderstanding and no wrong behavior.

Thank you for your explanations!

marxin commented 1 year ago

I see, thanks for the explanation. I guess an alternative would be to, instead of taking the first success and waiting for potentially better ones, to combine the successful intervals? Like if line a and line b can be removed individually, there's a high chance they can also be removed both simultaneously?

Yes, there's a high chance that both intervals (or more) can be combined. However, it would make the entire reduction algorithm much more complicated.

If you want, feel free to make a pull request with either documentation improvement, or with an attempt to more complicated merging of the intervals. Patches are welcome!

marxin / cvise

Parallelism doesn't do much #114