Parallelization limit causes unintuitive timing benchmarks (was: Multi-thread support is unclear)

vincentelfving commented 5 years ago

I am running the SDK QVM version 1.3.2 and ran into a very puzzling issue;

As far as I understood the QVM supports multiple workers/living on different threads. This is also what I find with qvm --benchmark, which by default runs a 26 qubit experiment. I'm running on a 12-core AMD 1920X which supports 24 threads, and during the benchmark I observe all CPU threads are active.

However, I am observing a strange deviation in run-time, in addition to CPU usage jumping from 100% to ~1500%, when I increase the qubit number from 18 to 19.

Minimal working example (I compiled the below via pyquil, and not directly in lisp, but below I show another example in command line QVM. ):

from pyquil.quil import Program
from pyquil.gates import H, CNOT, MEASURE
from pyquil.api import get_qc
import time

def test_multi_thread(N):
    prog = Program()
    ro = prog.declare('ro', memory_type='BIT', memory_size=N)

    prog += H(0)
    for j in range(N-1):
        prog += CNOT(j, j+1)

    for j in range(N):
        prog += MEASURE(j, ro[j])

    prog.wrap_in_numshots_loop(40)
    qc = get_qc(str(N)+'q-qvm')
    binary = qc.compile(prog)

    t = time.time()
    bitstrings = qc.run(binary)
    return time.time()-t

for n in [16, 17, 18, 19, 20, 21, 22]:
    print(str(n)+'-qubit experiment took ' + str(test_multi_thread(n))[:5] + ' seconds')

Then, the output for my particular machine results in: 16-qubit experiment took 5.752 seconds 17-qubit experiment took 12.02 seconds 18-qubit experiment took 26.18 seconds 19-qubit experiment took 8.229 seconds 20-qubit experiment took 14.66 seconds 21-qubit experiment took 29.66 seconds 22-qubit experiment took 53.15 seconds

Clearly, the 19-qubit experiment took shorter time than the 18 and even 17 qubit case. Based on the system monitor showing activity on a single vs multiple threads, therefore a hypothesis is multi-thread is enabled only past 18 qubits?

Also, I tested it by running the benchmark and it gives me this for 18 qubits:

qvm --benchmark 18
******************************
* Welcome to the Rigetti QVM *
******************************
Copyright (c) 2016-2019 Rigetti Computing.

This is a part of the Forest SDK. By using this program
you agree to the End User License Agreement (EULA) supplied
with this program. If you did not receive the EULA, please
contact <support@rigetti.com>.

(Configured with 10240 MiB of workspace and 24 workers.)

<134>1 2019-02-20T21:12:36Z vincentelfving-linux qvm 21685 - - Selected simulation method: pure-state
<134>1 2019-02-20T21:12:36Z vincentelfving-linux qvm 21685 - - Computing baseline serial norm timing...
<134>1 2019-02-20T21:12:36Z vincentelfving-linux qvm 21685 - - Baseline serial norm timing: 0 ms
<134>1 2019-02-20T21:12:36Z vincentelfving-linux qvm 21685 - - Starting "bell" benchmark with 18 qubits...

Evaluation took:
  0.134 seconds of real time
  0.134324 seconds of total run time (0.134324 user, 0.000000 system)
  100.00% CPU

which shows 100.00% CPU for 18 qubits, while if I select a benchmark with 19 qubits:

qvm --benchmark 19
******************************
* Welcome to the Rigetti QVM *
******************************
Copyright (c) 2016-2019 Rigetti Computing.

This is a part of the Forest SDK. By using this program
you agree to the End User License Agreement (EULA) supplied
with this program. If you did not receive the EULA, please
contact <support@rigetti.com>.

(Configured with 10240 MiB of workspace and 24 workers.)

<134>1 2019-02-20T21:12:31Z vincentelfving-linux qvm 21660 - - Selected simulation method: pure-state
<134>1 2019-02-20T21:12:31Z vincentelfving-linux qvm 21660 - - Computing baseline serial norm timing...
<134>1 2019-02-20T21:12:31Z vincentelfving-linux qvm 21660 - - Baseline serial norm timing: 1 ms
<134>1 2019-02-20T21:12:31Z vincentelfving-linux qvm 21660 - - Starting "bell" benchmark with 19 qubits...

Evaluation took:
  0.087 seconds of real time
  1.281765 seconds of total run time (1.079780 user, 0.201985 system)
  1473.56% CPU

it shows 1473.56% CPU... another indicator. Note that I find the same results with the option -w 24 added (which makes sense, it already defaulted to my system max of 24).

Is this behaviour reproduced on your side? If so, is it intentional?

stylewarning commented 5 years ago

Hey @vincentelfving, thanks for the comment.

It is an open issue to define the proper "parallelization limit". This is statically defined at 19 qubits currently here, but it should be calibrated on a per-machine basis. The number 19 was chosen because that's what it was for one model of laptop I was using. This issue is described here. This would be a wonderful contribution to determine where the crossing point is.

As a side note, as you increase the number of cores, for some qubit numbers, adding a new core doesn't actually give you a speedup. For instance, on my machine for a 32q benchmark, 10 cores vs 20 doesn't provide anything. This is something I intend to investigate to see if we can eek out more parallelization speedup, and if not, determine why (e.g., memory bandwidth, blowing the cache, etc.).

vincentelfving commented 5 years ago

@stylewarning alright, that makes sense! I realize that parallelization speedup is actually highly non-trivial.... I guess there is not 1 magic number. I would, however, love to see if it is possible to gain more speedup for low qubit numbers like N=4-18, for testing in cases where a huge number of (variational) circuits and iterations are necessary. In that case even every microsecond per circuit adds up significantly to the total time.

stylewarning commented 5 years ago

@vincentelfving I just made a PR #31 to allow this parameter to at least be controllable by you, though it won't be calculated automagically. It should be merged by EOD, if you care to build from source. If not, it'll be in the next release of the QVM.

vincentelfving commented 5 years ago

@stylewarning excellent, thanks a lot!

stylewarning commented 5 years ago

@vincentelfving Merged. Also, I notice you don't seem to be aware of the --compile or -c option. Try doing:

qvm --verbose --benchmark

and then

qvm --verbose --benchmark -c

It's not always the right thing to do, especially if you have a very small number of qubits, but it can bring great wins otherwise.

quil-lang / qvm

Parallelization limit causes unintuitive timing benchmarks (was: Multi-thread support is unclear) #29