marxin / cvise

Super-parallel Python port of the C-Reduce
Other
244 stars 25 forks source link

unintentional synchronization points? #41

Open nickdesaulniers opened 3 years ago

nickdesaulniers commented 3 years ago

When I run cvise, sometimes it seems stalled on making progress. When I check htop, it doesn't look like my cores are doing any work related to cvise. It looks like

00:00:03 INFO ===< LinesPass::0 >===

is what's running. I'm running cvise with --n 71. Are there synchronization points where all threads need to wait on one to make progress? Bisection will eventually converge, but it seems to take quite some time to make forward progress.

Playing with ps -ef, it looks like there are 71 instances of the python interpreter running, with 0 instances of clang, clang_delta, or my reproducer shell script.

nickdesaulniers commented 3 years ago

FWIW: attaching gdb and grabbing a py-bt looks like:

(gdb) py-bt
Traceback (most recent call first):
  <built-in method acquire of _multiprocessing.SemLock object at remote 0x7fc951da5470>
  File "/usr/local/google/home/ndesaulniers/.local/lib/python3.8/site-packages/pebble/pool/channel.py", line 140, in unix_acquire
    return (self.reader_mutex.acquire(timeout=LOCK_TIMEOUT) and
  File "/usr/local/google/home/ndesaulniers/.local/lib/python3.8/site-packages/pebble/pool/channel.py", line 130, in __enter__
    if self.acquire():
  File "/usr/local/google/home/ndesaulniers/.local/lib/python3.8/site-packages/pebble/pool/channel.py", line 110, in lock
    with self.mutex:
  <built-in method next of module object at remote 0x7fc9523fd0e0>
  File "/usr/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/usr/local/google/home/ndesaulniers/.local/lib/python3.8/site-packages/pebble/pool/process.py", line 440, in task_transaction
    with channel.lock:
  File "/usr/local/google/home/ndesaulniers/.local/lib/python3.8/site-packages/pebble/pool/process.py", line 433, in fetch_task
    return task_transaction(channel)
  File "/usr/local/google/home/ndesaulniers/.local/lib/python3.8/site-packages/pebble/pool/process.py", line 419, in worker_get_next_task
    yield fetch_task(channel)
  File "/usr/local/google/home/ndesaulniers/.local/lib/python3.8/site-packages/pebble/pool/process.py", line 404, in worker_process
    for task in worker_get_next_task(channel, params.max_tasks):
  File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.8/multiprocessing/process.py", line 571, in _bootstrap
  File "/usr/lib/python3.8/multiprocessing/popen_fork.py", line 75, in _launch
    code = process_obj._bootstrap(parent_sentinel=child_r)
  File "/usr/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/usr/lib/python3.8/multiprocessing/context.py", line 277, in _Popen
    return Popen(process_obj)
  File "/usr/lib/python3.8/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/usr/lib/python3.8/multiprocessing/process.py", line 633, in start
  File "/usr/local/google/home/ndesaulniers/.local/lib/python3.8/site-packages/pebble/common.py", line 141, in launch_process
    process.start()
  File "/usr/local/google/home/ndesaulniers/.local/lib/python3.8/site-packages/pebble/pool/process.py", line 632, in new_worker
  File "/usr/local/google/home/ndesaulniers/.local/lib/python3.8/site-packages/pebble/pool/process.py", line 364, in create_workers
    self.new_worker()
  File "/usr/local/google/home/ndesaulniers/.local/lib/python3.8/site-packages/pebble/pool/process.py", line 192, in start
    self.worker_manager.create_workers()
  File "/usr/local/google/home/ndesaulniers/.local/lib/python3.8/site-packages/pebble/pool/process.py", line 63, in _start_pool
    self._pool_manager.start()
  File "/usr/local/google/home/ndesaulniers/.local/lib/python3.8/site-packages/pebble/pool/base_pool.py", line 100, in _update_pool_state
    self._start_pool()
  File "/usr/local/google/home/ndesaulniers/.local/lib/python3.8/site-packages/pebble/pool/base_pool.py", line 91, in _check_pool_state
    self._update_pool_state()
  File "/usr/local/google/home/ndesaulniers/.local/lib/python3.8/site-packages/pebble/pool/process.py", line 90, in schedule
    self._check_pool_state()
  File "/android1/cvise/build/cvise/utils/testing.py", line 693, in run_parallel_tests
  File "/android1/cvise/build/cvise/utils/testing.py", line 764, in run_pass
  File "/android1/cvise/build/cvise/cvise.py", line 161, in _run_additional_passes
    self.test_manager.run_pass(p)
  File "/android1/cvise/build/cvise/cvise.py", line 394, in reduce
  File "/android1/cvise/build/cvise.py", line 1051, in <module>
marxin commented 3 years ago

When I run cvise, sometimes it seems stalled on making progress. When I check htop, it doesn't look like my cores are doing any work related to cvise. It looks like

00:00:03 INFO ===< LinesPass::0 >===

is what's running. I'm running cvise with --n 71.

Interesting. How long does such a stall take? I'm quite often running with -n 64 one a machine and it seems fine.

Are there synchronization points where all threads need to wait on one to make progress? Bisection will eventually converge, but it seems to take quite some time to make forward progress.

What I do is starting N jobs (if we have enough "states") and waiting for a first successfull. Once that happens, I wait for all jobs started before the first successfull. Rest is canceleed.

Playing with ps -ef, it looks like there are 71 instances of the python interpreter running, with 0 instances of clang, clang_delta, or my reproducer shell script.

Anyway, I think it's a stall in Pebble library and it would be worth creating a reproducer for it and report it to upstream.

marehr commented 3 years ago

I'm not sure if this is related, but for me the step "UnIfDefPass" does sometimes stall and does not progress. Even though that step has nothing to do. After re-running the script several times, it will eventually skip that step.

The timeout does not seem to interrupt.

I'm calling cvise like this:

cvise --n 40 --skip-key-off --timeout 20 --clang-delta-std=c++20 ./creduce_probe.sh ice.cpp

I'm not on the newest version right now (cvise 2.2.0 (4c7f4cd)), so I'll first update and reevaluate the situation.

marxin commented 3 years ago

I'm not sure if this is related, but for me the step "UnIfDefPass" does sometimes stall and does not progress. Even though that step has nothing to do. After re-running the script several times, it will eventually skip that step.

It's likely something different. Can you please paste output of cvise with --verbose?

The timeout does not seem to interrupt.

I'm calling cvise like this:

cvise --n 40 --skip-key-off --timeout 20 --clang-delta-std=c++20 ./creduce_probe.sh ice.cpp

I'm not on the newest version right now (cvise 2.2.0 (4c7f4cd)), so I'll first update and reevaluate the situation.

That's likely not a change since 2.2.0 that would resolve it. Would it be possible to attach also a reproducer?

marxin commented 3 years ago

Would it be possible to attach also a reproducer?

And please open a new issue for it.

nickdesaulniers commented 2 years ago

What I do is starting N jobs (if we have enough "states") and waiting for a first successfull. Once that happens, I wait for all jobs started before the first successfull. Rest is canceleed.

Right, the call to pool.join() seems to be taking a long time.

https://github.com/noxdafox/pebble/issues/50 looks related, though it sounds like that was fixed in 4.5.1.

pip3 list | grep Pebble
Pebble                      4.6.1

Otherwise https://github.com/noxdafox/pebble/issues/62 and https://github.com/noxdafox/pebble/issues/33 look potentially related.

One thing I found interesting, grepping for the parent pid:

$ ps -ef | grep 1052950
ndesaul+ 1052950 3884808  2 14:44 pts/4    00:02:06 python3 /android1/cvise/build/cvise.py --n 71 ./repro.sh extable.i
ndesaul+ 1088933 1052950  2 14:46 pts/4    00:02:00 python3 /android1/cvise/build/cvise.py --n 71 ./repro.sh extable.i
ndesaul+ 1368039 1052950  0 15:55 pts/4    00:00:00 python3 /android1/cvise/build/cvise.py --n 71 ./repro.sh extable.i

makes it look like the one child has been running for 2 minutes? and the rest for 0?

nickdesaulniers commented 2 years ago

If I drop --n to 4 from 71, it's much faster.

nickdesaulniers commented 2 years ago

moving from 1a9f47f53037 to 21c4e265556bba062898cbb673982a6fa1444d1f seems to have helped.

no --n:

00:17:04 INFO ===================== done ====================

(previously, reducing the same input with the older version of cvise took over 1hr10min)

--n 71 (long pauses were observed while running LinesPass::0 where cores look idle:

00:24:50 INFO (97.9%, 54600 bytes, 429 lines)
(killed, still running)

here's the python trace of the long running child (--n 71:

(gdb) py-bt
Traceback (most recent call first):
  File "/usr/lib/python3.9/threading.py", line 316, in wait
    gotit = waiter.acquire(True, timeout)
  File "/usr/lib/python3.9/threading.py", line 574, in wait
    signaled = self._cond.wait(timeout)
  File "/usr/lib/python3.9/multiprocessing/managers.py", line 176, in serve_forever
    self.stop_event.wait(1)
  File "/usr/lib/python3.9/multiprocessing/managers.py", line 591, in _run_server
    server.serve_forever()
  File "/usr/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.9/multiprocessing/process.py", line 571, in _bootstrap
  File "/usr/lib/python3.9/multiprocessing/popen_fork.py", line 71, in _launch
    code = process_obj._bootstrap(parent_sentinel=child_r)
  File "/usr/lib/python3.9/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/usr/lib/python3.9/multiprocessing/context.py", line 277, in _Popen
    return Popen(process_obj)
  File "/usr/lib/python3.9/multiprocessing/process.py", line 633, in start
  File "/usr/lib/python3.9/multiprocessing/managers.py", line 1322, in start
    # launching the manager process, so that concurrent
  File "/usr/lib/python3.9/multiprocessing/context.py", line 57, in Manager
    m.start()
  File "/android1/cvise/build/cvise/utils/testing.py", line 472, in run_pass
    m = Manager()
  File "/android1/cvise/build/cvise/cvise.py", line 166, in _run_additional_passes
    self.test_manager.run_pass(p)
  File "/android1/cvise/build/cvise/cvise.py", line 399, in reduce
  File "/android1/cvise/build/cvise.py", line 1073, in <module>

and an idle child

(gdb) py-bt
Traceback (most recent call first):
  <built-in method acquire of _multiprocessing.SemLock object at remote 0x7fbf369e1630>
  File "/usr/local/google/home/ndesaulniers/.local/lib/python3.9/site-packages/pebble/pool/channel.py", line 146, in unix_acquire
    return (self.reader_mutex.acquire(timeout=LOCK_TIMEOUT) and
  File "/usr/local/google/home/ndesaulniers/.local/lib/python3.9/site-packages/pebble/pool/channel.py", line 136, in __enter__
    if self.acquire():
  File "/usr/local/google/home/ndesaulniers/.local/lib/python3.9/site-packages/pebble/pool/channel.py", line 111, in lock
    with self.mutex:
  <built-in method next of module object at remote 0x7fbf375228b0>
  File "/usr/lib/python3.9/contextlib.py", line 119, in __enter__
    return next(self.gen)
  File "/usr/local/google/home/ndesaulniers/.local/lib/python3.9/site-packages/pebble/pool/process.py", line 460, in task_transaction
    with channel.lock:
  File "/usr/local/google/home/ndesaulniers/.local/lib/python3.9/site-packages/pebble/pool/process.py", line 453, in fetch_task
    return task_transaction(channel)
  File "/usr/local/google/home/ndesaulniers/.local/lib/python3.9/site-packages/pebble/pool/process.py", line 439, in worker_get_next_task
    yield fetch_task(channel)
  File "/usr/local/google/home/ndesaulniers/.local/lib/python3.9/site-packages/pebble/pool/process.py", line 423, in worker_process
    for task in worker_get_next_task(channel, params.max_tasks):
  File "/usr/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.9/multiprocessing/process.py", line 571, in _bootstrap
  File "/usr/lib/python3.9/multiprocessing/popen_fork.py", line 71, in _launch
    code = process_obj._bootstrap(parent_sentinel=child_r)
  File "/usr/lib/python3.9/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/usr/lib/python3.9/multiprocessing/context.py", line 277, in _Popen
    return Popen(process_obj)
  File "/usr/lib/python3.9/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/usr/lib/python3.9/multiprocessing/process.py", line 633, in start
  File "/usr/local/google/home/ndesaulniers/.local/lib/python3.9/site-packages/pebble/common.py", line 397, in launch_process
--Type <RET> for more, q to quit, c to continue without paging--
  File "/usr/local/google/home/ndesaulniers/.local/lib/python3.9/site-packages/pebble/pool/process.py", line 648, in new_worker
  File "/usr/local/google/home/ndesaulniers/.local/lib/python3.9/site-packages/pebble/pool/process.py", line 380, in create_workers
    self.new_worker()
  File "/usr/local/google/home/ndesaulniers/.local/lib/python3.9/site-packages/pebble/pool/process.py", line 205, in start
    self.worker_manager.create_workers()
  File "/usr/local/google/home/ndesaulniers/.local/lib/python3.9/site-packages/pebble/pool/process.py", line 69, in _start_pool
    self._pool_manager.start()
  File "/usr/local/google/home/ndesaulniers/.local/lib/python3.9/site-packages/pebble/pool/base_pool.py", line 100, in _update_pool_state
    self._start_pool()
  File "/usr/local/google/home/ndesaulniers/.local/lib/python3.9/site-packages/pebble/pool/base_pool.py", line 91, in _check_pool_state
    self._update_pool_state()
  File "/usr/local/google/home/ndesaulniers/.local/lib/python3.9/site-packages/pebble/pool/process.py", line 102, in schedule
    self._check_pool_state()
  File "/android1/cvise/build/cvise/utils/testing.py", line 704, in run_parallel_tests
  File "/android1/cvise/build/cvise/utils/testing.py", line 776, in run_pass
  File "/android1/cvise/build/cvise/cvise.py", line 166, in _run_additional_passes
    self.test_manager.run_pass(p)
  File "/android1/cvise/build/cvise/cvise.py", line 399, in reduce
  File "/android1/cvise/build/cvise.py", line 1073, in <module>

Am I understanding correctly that pebble is blocking for 60s if it fails to acquire a lock? https://github.com/noxdafox/pebble/blob/b7bde8c5c7f44c0fd526c1db33bc972913778765/pebble/pool/channel.py#L156 https://github.com/noxdafox/pebble/blob/b7bde8c5c7f44c0fd526c1db33bc972913778765/pebble/pool/channel.py#L198

nickdesaulniers commented 2 years ago

File "/usr/local/google/home/ndesaulniers/.local/lib/python3.9/site-packages/pebble/pool/process.py", line 460, in task_transaction with channel.lock:

https://docs.python.org/3/library/threading.html#using-locks-conditions-and-semaphores-in-the-with-statement

so no timeout is specified?

nickdesaulniers commented 2 years ago

no --n: 00:17:04 INFO ===================== done ====================

^ so 36 parallel tests

--n 4:

00:28:47 INFO ===================== done ====================

--n 1:

01:30:51 INFO ===================== done ====================

I should mention this is a dual socket system; I wonder if synchronization overhead is much much worse due to NUMA? Perhaps setting affinity to one socket would help? Perhaps if I play with numactl...

$ numactl --cpunodebind=0 /android1/cvise/build/cvise.py ./repro.sh extable.i
00:15:20 INFO ===================== done ====================

so that is slightly faster, though I guess by nature these are probably non-deterministic results. I didn't observe any pauses that run.

If I double the number of threads --n 72 AND use numactl, then I swear I keep seeing 1 minute pauses where no progress is made, and my cores look idle.

00:01:11 INFO (74.3%, 669465 bytes, 17621 lines)
<pause>
00:02:12 INFO (74.3%, 669367 bytes, 17619 lines)
...
00:02:28 INFO (88.2%, 307085 bytes, 2521 lines)
<pause>
00:03:29 INFO (88.8%, 292190 bytes, 2473 lines)
...
00:06:30 INFO (98.6%, 35650 bytes, 255 lines)
<pause>
00:07:30 INFO (98.6%, 35591 bytes, 254 lines)
...
00:07:31 INFO (98.7%, 35035 bytes, 252 lines)
<pause>
00:08:31 INFO (98.7%, 34907 bytes, 251 lines)
...
00:08:53 INFO (99.0%, 25463 bytes, 473 lines)
<pause>
00:09:53 INFO (99.0%, 25043 bytes, 456 lines)
...
00:09:54 INFO (99.1%, 24252 bytes, 424 lines)
<pause>
00:10:55 INFO (99.1%, 24070 bytes, 416 lines)
...
00:10:57 INFO (99.1%, 23072 bytes, 380 lines)
<pause>
00:11:57 INFO (99.1%, 22932 bytes, 376 lines)
# etc

I definitely think pebble's 60 timeout is related. If I hack up /usr/local/google/home/ndesaulniers/.local/lib/python3.9/site-packages/pebble/pool/channel.py locally and set it to 10, then oversubscribe my system again, I do still see idle pauses, but they drop down to ~10s.

# LOCK_TIMEOUT = 1
$ numactl --cpunodebind=0 /android1/cvise/build/cvise.py --n 72 ./repro.sh extable.i
00:20:33 INFO ===================== done ====================

wait_for_first_success might throw an exception though if I lower it too far and the interestingness test doesn't complete in time.

diff --git a/cvise/utils/testing.py b/cvise/utils/testing.py
index 5deb23e7d60b..47edd950206c 100644
--- a/cvise/utils/testing.py
+++ b/cvise/utils/testing.py
@@ -427,6 +427,7 @@ class TestManager:
     def run_parallel_tests(self):
         assert not self.futures
         assert not self.temporary_folders
+        pebble.process.channel.LOCK_TIMEOUT = 0.1
         with pebble.ProcessPool(max_workers=self.parallel_tests) as pool:
             order = 1
             self.timeout_count = 0
$ /android1/cvise/build/cvise.py ./repro.sh extable.i
00:17:16 INFO ===================== done ====================

actually, this is probably about the fastest I can run:

diff --git a/cvise/utils/testing.py b/cvise/utils/testing.py
index 5deb23e7d60b..e697bd54b0bf 100644
--- a/cvise/utils/testing.py
+++ b/cvise/utils/testing.py
@@ -421,8 +421,11 @@ class TestManager:

     @classmethod
     def terminate_all(cls, pool):
+        temp = pebble.process.channel.LOCK_TIMEOUT
+        pebble.process.channel.LOCK_TIMEOUT = 0.001
         pool.stop()
         pool.join()
+        pebble.process.channel.LOCK_TIMEOUT = temp

     def run_parallel_tests(self):
         assert not self.futures
$ numactl --cpunodebind=0 /android1/cvise/build/cvise.py ./repro.sh extable.i
00:15:26 INFO ===================== done ====================

so there's probably 2 things we can do immediately:

  1. don't wait 1 minute sometimes when simply trying to join threads (above diff)
  2. detect+warn users of NUMA systems that there's too much synchronization overhead and that it might be faster if they run under numactl.

I'm curious. It seems like cvise starts N workers, then terminates them all once the first has found whether its mutated input is interesting? So it seems like we get killed in synchronization overhead for that and don't make much forward progress the larger N gets. It also seems like it probably throws out a lot of forward progress each thread may have made individually.

I wonder if you could instead have each thread generate a diff when it found something interesting, then during synchronization apply all of the diffs all of the workers may have found in a round?

I'm also curious if it makes more sense to dynamically scale the number of workers based on how many are finding inputs interesting over time? If most mutations are generally interesting, then having more than 1 worker is probably overkill. If most mutations are generally uninteresting, having as many workers working on solutions is probably profitable.

I'm also curious if we're hitting a priority inversion between the parent and child threads?

marxin commented 2 years ago

Thank you very much for the nice investigation!

actually, this is probably about the fastest I can run:

diff --git a/cvise/utils/testing.py b/cvise/utils/testing.py
index 5deb23e7d60b..e697bd54b0bf 100644
--- a/cvise/utils/testing.py
+++ b/cvise/utils/testing.py
@@ -421,8 +421,11 @@ class TestManager:

     @classmethod
     def terminate_all(cls, pool):
+        temp = pebble.process.channel.LOCK_TIMEOUT
+        pebble.process.channel.LOCK_TIMEOUT = 0.001
         pool.stop()
         pool.join()
+        pebble.process.channel.LOCK_TIMEOUT = temp

     def run_parallel_tests(self):
         assert not self.futures

I do welcome the patch, please make a separate pull request for it. Moreover, we can likely notify Pebble project that this can really cause troubles.

2. detect+warn users of NUMA systems that there's too much synchronization overhead and that it might be faster if they run under numactl.

Support that as well.

marxin commented 2 years ago

I'm curious. It seems like cvise starts N workers, then terminates them all once the first has found whether its mutated input is interesting?

Once the first one finds something interesting, we wait for all jobs that started before and kill all after it. Doing that we have deterministic output.

So it seems like we get killed in synchronization overhead for that and don't make much forward progress the larger N gets. It also seems like it probably throws out a lot of forward progress each thread may have made individually.

Yes, that can possibly happen.

I wonder if you could instead have each thread generate a diff when it found something interesting, then during synchronization apply all of the diffs all of the workers may have found in a round?

I like the idea which is basically about combine all results and run interestingness test for it.

I'm also curious if it makes more sense to dynamically scale the number of workers based on how many are finding inputs interesting over time? If most mutations are generally interesting, then having more than 1 worker is probably overkill. If most mutations are generally uninteresting, having as many workers working on solutions is probably profitable.

Yep, we can dynamically calculate the success ratio and adapt workers. Or we can at least add '+', '-' keyboard shortcuts one can use to double the size of workers?

I'm also curious if we're hitting a priority inversion between the parent and child threads?

I don't understand it, can you please explain it?

Anyway, great observations you made.

nickdesaulniers commented 2 years ago

I need to play with my patch more. I thought I had it solved last night, but if I crank up --n to be nproc times 2, then I see minute long pauses again where my cpu cores are idle.

nickdesaulniers commented 2 years ago

Perhaps the Python bindings to libuv might help.

nickdesaulniers commented 2 years ago

heh, running cvise on my dual core hyper threaded laptop is faster than my 72 core xeon workstation...same reproducer, same inputs, same compiler under test...

workstation:

01:04:12 INFO ===================== done ====================

laptop:

00:35:53 INFO ===================== done ====================
marxin commented 2 years ago

Perhaps the Python bindings to libuv might help.

Do you mean using it in the Pebble project?

nickdesaulniers commented 2 years ago

Or swap Pebble for libuv. I wonder if we can speed things up by sending sigterm to processes we want to terminate.

marxin commented 2 years ago

Or swap Pebble for libuv.

But does it provide a Process pool, I can't find it.

About sigterm - can you please investigate Pebble which signal is used when a future is canceled?

marxin commented 2 years ago

Hi.

Note there's a recent Pebble 5.0 release: https://github.com/noxdafox/pebble/releases/tag/5.0.0

Does the problem still persist? If so, please provide a reproducer I can play with.

nickdesaulniers commented 2 years ago

I've recently moved to a newer workstation; one that's not dual socket, so I might no longer be able to reproduce. I'll report back once I've had another opportunity to run cvise.

Besides, I suspect a reproducer would need to be run on a NUMA machine...

marxin commented 2 years ago

All right, thanks!