Closed stavros11 closed 2 years ago
Here are some numbers using the compare.py
script for Qibo (default qibojit backend), qiskit and qulacs, all on qibo machine CPU. Note that unlike our first paper Qiskit is using all threads and performance is particularly good. I confirmed in several cases that the correct wavefunction is returned, so the simulation is not skipped. I am not sure if they do some kind of circuit simplification to achieve that performance.
@scarrazza you can confirm Qiskit's performance by running something simple, eg a QFT for 30 qubits: python compare.py --nqubits 30 --circuit qft --library qiskit
. On qibo machine this takes 37sec with Qiskit, 50sec with Qibo and 80sec with Qulacs.
EDIT: Added qibotf times.
Thanks for these numbers, do you have similar number for qibotf? For some circuits like hs and qv the difference is too large, are you sure that qiskit is using CPU instead of GPU? What is the average total program execution time, maybe qiskit is precomputing objects during the circuit definition? Does the final state vector is the same for all backends?
Btw, how many threads qiskit is using? It might be possible that this value is different from our default, e.g. limiting the number of threads might have an impact.
Bwt2, does qiskit is really double precision? If I set qibo to single, I get numbers which are quite close to qiskit...
Thanks for the response and the questions. Some quick answers:
Thanks for these numbers, do you have similar number for qibotf?
I added the possibility to use qibotf in the same script in the latest push, I will update the above tables once I have the numbers. I don't expect much difference from qibojit, certainly will not be much closer to Qiskit.
For some circuits like hs and qv the difference is too large, are you sure that qiskit is using CPU instead of GPU?
I haven't checked htop explicitly during all benchmarks but all the Qiskit runs I checked use CPU. I think Qiskit only uses GPU when the appropriate simulator is used. I also used export CUDA_VISIBLE_DEVICES=""
before running all these benchmarks.
What is the average total program execution time, maybe qiskit is precomputing objects during the circuit definition?
The benchmark script logs the circuit creation time too, which in this corresponds to transforming the OpenQASM circuit to the library circuit. Here are the numbers from the above benchmarks:
Indeed Qiskit has slighlty higher creation in all cases but still wins when considering the sum creation + execution.
Does the final state vector is the same for all backends?
This is exactly what is tested in the new test_libraries.py
for all circuits, except qv due to the U3 convention issue. I will try to do a check using the benchmark script too but from a quick look it seems that Qiskit returns the expected states, that's why I wrote that I don't think that something strange like skipping the simulation happens.
Btw, how many threads qiskit is using? It might be possible that this value is different from our default, e.g. limiting the number of threads might have an impact.
Qiskit and Qulacs use all available threads while Qibo uses half of them. This may cause some of the difference but I don't think it explains the whole difference. In past Qibo benchmarks using all threads had minimal change in performance.
Bwt2, does qiskit is really double precision? If I set qibo to single, I get numbers which are quite close to qiskit...
I am not sure exactly what happens during simulation but if I do result.dtype
in the state returned by Qiskit I get complex128
. Also according to their docs double precision is used by default.
@stavros11 thanks for the comments. I have tested and indeed qiskit is 2x faster when using single precision. Starting from the QFT, if I keep only first layer of H gates, qiskit is 1s faster than qibo. At this point we should revisit each gate, if the single gates have similar performance, then I agree that some extra parallelization is performed by qiskit.
In particular, if I yield just 1 Hadamard the qibo performance is better than qiskit, however as soon as I include 5 Hadamard, one per qubit, the qiskit performance is better, so this sounds like circuit fusion/block parallelization.
Following their docs I think this latest version of qiskit:
Last comment about that, if I set self.simulator.set_options(fusion_enable=False)
and use all threads in qibojit, I get almost the same performance for qiskit and qibo. So, it is the fusion that accelerates the computation.
We should look into that and check if qibo can support it.
Last comment about that, if I set
self.simulator.set_options(fusion_enable=False)
and use all threads in qibojit, I get almost the same performance for qiskit and qibo. So, it is the fusion that accelerates the computation.
I have been doing the benchmark using the same option and can confirm that performance is the same with Qibo. Here are the results for all circuits:
So results are pretty much similar with the exception of dry run times from small qubit numbers. I am not sure if this can be improved if we disable parallelization for nqubits < 14 as Qiskit does by default.
We should look into that and check if qibo can support it.
I agree we should revisit gate fusion in Qibo and if performance is improved so much for most common circuits we could consider making default with some cut-off in the number of qubits. We should open an issue about that in Qibo.
By the way, I added the option to use Qiskit without fusion in the benchmark script (via --library qiskit-nofusion
) and also GPU support (--library qiskit-gpu
and --library qulacs-gpu
). I noticed that when using Qiskit GPU the final state returned is wrong, tests do not pass and if I print the final state from benchmark it is different than other backends (including Qiskit CPU).
I'm not yet sure if this is a bug with Qiskit or a problem in our code but will investigate it further (just noting it in case you try to run something in the meantime).
@stavros11 thank you very much for these numbers and confirmation. I agree concerning fusion and the possibility to set threads automatically, as you have posted in the issue. I will try the new GPU implementations tomorrow.
@stavros11 2 points:
Quick response before I take off for Abu Dhabi:
- which tests are failing for you with qiskit-gpu?
I was checking this thoroughly yesterday and interestingly the problem exists only on my local machine. I tried both DGX and qibo machine and qiskit-gpu works well there. In my machine I get errors even when using simple qiskit circuits, without all the benchmark code we have here. I’ll give a simple script later. I’m not sure if it is related to CUDA version or something is wrong in my configuration. I followed the same installation procedure everywhere (just pip install qiskit-aer-gpu).
So I believe the code here is okay to try GPU benchmarks as it is. We just need to expand by adding QCGPU and HyQuas.
- the qiskit-gpu performance does not change with the fusion_enable flag, does this happen also for you?
I haven’t checked how fusion affects GPU yet.
@stavros11, tests are passing on my pc however, if I print the result
during dry run and simulation run (like a manual --transfer
with nrep=1) I get:
[0.00012207+0.j 0.00012207+0.j 0.00012207+0.j ... 0.00012207+0.j 0.00012207+0.j 0.00012207+0.j]
[benchmarks|INFO|2021-08-16 21:43:11]: dry_run_transfer_time: 0.0006430149078369141
[0.00012207+0.j 0.00012207+0.j 0.00012207+0.j ... 0.00012207+0.j 0.00012207+0.j 0.00012207+0.j]
[1.+0.j 0.+0.j 0.+0.j ... 0.+0.j 0.+0.j 0.+0.j]
[benchmarks|INFO|2021-08-16 21:42:39]: dry_run_transfer_time: 0.0003075599670410156
[1. +0.j 0.0625+0.j 0.0625+0.j ... 0. +0.j 0. +0.j 0. +0.j]
Does this happen for you?
@stavros11, tests are passing on my pc
Note that the tests that are uploaded on GitHub do not test the GPU backends. In order to test these you have to include "qiskit-gpu" and "qulacs-gpu" in the LIBRARIES
list in conftest.py.
- for qibojit CPU/GPU and qiskit CPU I get sensible results:
[0.00012207+0.j 0.00012207+0.j 0.00012207+0.j ... 0.00012207+0.j 0.00012207+0.j 0.00012207+0.j] [benchmarks|INFO|2021-08-16 21:43:11]: dry_run_transfer_time: 0.0006430149078369141 [0.00012207+0.j 0.00012207+0.j 0.00012207+0.j ... 0.00012207+0.j 0.00012207+0.j 0.00012207+0.j]
- however for qiskit-gpu, the performance is quite strange (~4x faster than qibojit), and I get these wrong prints:
[1.+0.j 0.+0.j 0.+0.j ... 0.+0.j 0.+0.j 0.+0.j] [benchmarks|INFO|2021-08-16 21:42:39]: dry_run_transfer_time: 0.0003075599670410156 [1. +0.j 0.0625+0.j 0.0625+0.j ... 0. +0.j 0. +0.j 0. +0.j]
Does this happen for you?
Yes, I observe some strange behavior from qiskit-gpu in all machines. If I add "qiskit-gpu" in the tests, they fail on my machine but pass on Qibomachine. However when I print the state during the benchmark as in your example, I get wrong results in all machine. Also the final state changes if I run the same script more than once even though there is nothing random involved.
Here is a simple script that reproduces these issues:
import qiskit
from qiskit.providers.aer import StatevectorSimulator
def main(nqubits, nreps, gpu, transpile):
for _ in range(nreps):
circuit = qiskit.QuantumCircuit(nqubits)
for i in range(nqubits):
circuit.h(i)
if gpu:
simulator = StatevectorSimulator(method="statevector_gpu")
else:
simulator = StatevectorSimulator()
if transpile:
circuit = qiskit.transpile(circuit, simulator)
print("nqubits:", nqubits)
print("nreps:", nreps)
print("gpu:", gpu)
print("transpile:", transpile)
result = simulator.run(circuit).result()
print(result.get_statevector(circuit))
print()
@scarrazza, if you try to run this with gpu = True
and nreps > 1
it is very likely that you will get different states between each repetition even though the same circuit is simulated. If you run the same script more than once you may also get different states in each run. Currently in qibo machine the problem appears only when nqubits >= 10, however in my local machine I get even for two qubits.
@stavros11 I confirm all your points. I was monitoring the GPU usage on different systems while running pytest and I realized that only in the qibomachine it doesn't seem to use any GPU % during tests, so maybe it is falling back to CPU (I think qiskit provides some get_device method to check if the backend is using CPU or GPU).
Did you try the qft using qiskit.*.library.QFT directly?
Did you try the qft using qiskit.*.library.QFT directly?
If I replace the circuit creation with qiskit.circuit.library.QFT
in the above script the problem remains for GPU. Note that for the built-in QFT I have to use the transpile option otherwise I get a different error when attempting to get the statevector on both CPU and GPU:
qiskit.exceptions.QiskitError: 'Data for experiment "QFT" could not be found.'
@stavros11 I just monitor the pytest performance on test_libraries for 5, 10, 15, 26 qubits. Tests are failing for 15 and 26, for these tests I can see GPU usage high and CPU usage low, however for <= 10 the CPU usage is very high and GPU is low. So I assume they have some fallback mechanism which selects the appropriate hardware.
As discussed today, let me suggest to complete the other libraries listed in the first post, and perform a final decision afterwards.
@stavros11 concerning qiskit, I have opened this issue https://github.com/Qiskit/qiskit-aer/issues/1319, and they have proposed a fix in this PR https://github.com/Qiskit/qiskit-aer/pull/1325. So it is a qiskit bug.
@stavros11 I have installed the aer master locally and indeed the GPU problem is fixed. On the other hand their performance is a factor 2x slower than qibojit.
@scarrazza here are some plots using the circuits and libraries we have so far for CPU:
It seems that creation time is the main bottleneck for some libraries and circuits. This is the time required to convert the circuit from Qasm to the library's format. For Qulacs I do this conversion manually as I could not find a qasm parser on their docs but for Qiskit I am using QuantumCircuit.from_qasm_str
. Anyway, this time is logged seperately from the simulation and dry run times so we may choose to not include it in the plots if we wish, even though it will appear when simulating in practice as the circuit needs to be created.
Other than that, I will try to run some single precision benchmarks with Qibo, Cirq, Qiskit and TFQ because I could not find how to switch TFQ to double and also some GPU benchmarks with Qibo, QCGPU, Qulacs and Qiskit (if their GPU simulator is fixed). Let me know what other configurations and plots would be interesting.
Cool, thanks for these interesting results.
I think we should have a look at the dry-run, I have the suspicious our initialization is not 100% due to *jit, but maybe the object allocations (gate matrices, etc...).
I think we should have a look at the dry-run, I have the suspicious our initialization is not 100% due to *jit, but maybe the object allocations (gate matrices, etc...).
That is a good point and makes sense because some elements such as gate matrices are allocated during the first execution (which is the dry run) and cached for subsequent runs. However I tried executing the benchmark by recreating a new circuit object before every execution (dry run and simulation) and the difference between dry run and simulation remains. Here are some numbers:
Thanks for checking, this sounds like some for loop overhead. At some point, after completing the codes / libs for this exercise, one should go step by step and profile the function calls and identify where we loose performance.
Thanks for checking, this sounds like some for loop overhead. At some point, after completing the codes / libs for this exercise, one should go step by step and profile the function calls and identify where we loose performance.
I am not sure if this helps, but I tried profiling the benchmark script using cProfile and I noticed that the difference between the logged dry run time and simulation time is similar to the cumulative time of numba's Dispatcher.compile
which is logged in the profiling result file. So I tried profiling for multiple qubit number and circuit configurations and it appears that there is some kind of agreement:
Here dry run and simulation are logged by the benchmark script, while the numba compile is read from the cProfile output as the cumulative time of the Dispatcher.compile
calls. It appears that this function explains a large part of the dry run overhead. The only thing that I cannot really explain is the negative difference that appears in two cases for 25 qubits qft and variational.
Here are some single precision plots including tfq:
Here dry run and simulation are logged by the benchmark script, while the numba compile is read from the cProfile output as the cumulative time of the Dispatcher.compile calls. It appears that this function explains a large part of the dry run overhead. The only thing that I cannot really explain is the negative difference that appears in two cases for 25 qubits qft and variational.
Thanks @stavros11, could you please rerun one of these examples by removing the cache=True
flag? I find quite strange that we have a ~0.23s for loading, maybe this cache flag is not working?
Concerning the plots, do you understand why qiskit dry run is faster than simulation for 28 qubits qft?
Concerning the plots, do you understand why qiskit dry run is faster than simulation for 28 qubits qft?
I believe for qiskit there is no big difference between dry run and simulation for high number of qubits, here are the absolute numbers used in these plots for qiskit:
Note that in the bar plots I am normalizing all times with respect to qibo, that's why qibo is always 1.
Thanks @stavros11, could you please rerun one of these examples by removing the
cache=True
flag? I find quite strange that we have a ~0.23s for loading, maybe this cache flag is not working?
Thanks for this proposal. I removed the cache=True
from all numba kernels and retried the same measurement:
So the cache option seems to work as it reduces the compilation time from 2sec to 0.2sec. Also, the observation that (dry run) - (simulation) - (numba compile call) = (almost 0) that we saw above still holds when not using the cache.
Following our discussion, I started producing some plots that compare the times required for each part:
seperately, as well as a total script ( = import + circuit creation + dry run) comparison. During this I noticed that Qibo's import time is significantly longer than other libraries, so I tabulated the import time for various configurations on Qibo machine:
library | import time (sec) |
---|---|
qulacs | 0.00479 |
qiskit | 0.64582 |
cirq | 1.07991 |
qibo (numpy) | 0.27475 |
qibo (numpy + qibojit) | 0.48673 |
qibo (numpy + qibojit + tensorflow) | 1.41568 |
qibo (numpy + qibojit + tensorflow + qibotf) | 1.66606 |
It appears that our import time is very long because during import qibo
we initialize all backends that are available. So if tensorflow is installed we will import it even if we don't use it. If we fix this, the above numbers show that our import time with qibojit as default will fall to about 0.15sec below qiskit's import time which will counterbalance the 0.2sec loss from dry run.
I will open a PR in Qibo where I update the backend initialization procedure so that only the default backend is created and tensorflow is imported only if the user switches to the corresponding backend and then I will repeat the above comparison. Then, I believe our numbers will be competitive even when considering the dry run, it will just be a matter of whether we want to keep the overhead in dry run or move it to import. @scarrazza let me know if you agree.
@stavros11 thank you very much for this numbers. Indeed, as expected tf slows down a lot the initialization. So, yes please ahead with the tf import cleanup.
it will just be a matter of whether we want to keep the overhead in dry run or move it to import. @scarrazza let me know if you agree.
I think the tracing idea has some advantage, in particular the fact that simulation performance is reproducible without a dry run stage. I have a recollection that if you set the signature of numba functions explicitly, the function is compiled/loaded at import time automatically, but I not sure this is the case with the latest numba version.
Here are some results on GPU after the import time fixed. I checked and the latest version of qiskit-aer seems to give right results on GPU.
In bar plots three different times (import + creation + simulation/dry run) are plotted on top of each other, so the total bar refers to the total time the user will get when running a script. There is no normalization so in the 26 qubit plot I removed "bc" as the corresponding bars were much taller hiding the rest circuits.
@stavros11 thanks for these plots. I am surprised by the difference between dry run and simulation, if I remember our initial tests, the cupy compilation was quite fast, say in the millisecond range, while here we see an extra 0.5s. Maybe I am missing something?
@stavros11 thanks for these plots. I am surprised by the difference between dry run and simulation, if I remember our initial tests, the cupy compilation was quite fast, say in the millisecond range, while here we see an extra 0.5s. Maybe I am missing something?
I believe this difference always existed. For example, if you see the original benchmarks when we first implemented the cupy custom operators here, there is a difference of at least 0.5sec between dry run and execution.
Given that we have OneQubitGate
and TwoQubitGate
circuits, shall we add an equivalent circuit with multi-qubit gates?
Adds a template and script for benchmarking external quantum simulation libraries different than Qibo (fixes #10). We should cover at least the libraries included in HyQuas benchmark paper. Here is a list of required libraries:
Python:
These benchmarks can be executed using the new
compare.py
script and the library is selected using the--library
flag.The supported libraries are defined under
benchmarks/libaries
and the goal is to support all circuits included the Qibo benchmark for all libraries. This works by defining every circuit using OpenQASM and then build each library's circuit from this. This is straightforward for libraries that have built-in Qasm loaders such as Qiskit and Qibo, while for the rest (eg. Qulacs) I use the Qasm parser we have in Qibo modified to add the gates from the corresponding library. All circuits we have here can be written in the Qasm format we support in Qibo except perhaps QAOA which contains some RZZ gates which we do not have built-in in Qibo.Next steps for this PR:
--library qibo
we should have--library qibojit
, etc.)Note: I noticed that Qibo's U2 and U3 gates follow a different parameter convention when compared to Qiskit and other libraries. For example check our docs vs Qiskit's docs. This should not affect performance which is what we mainly care about here but it may confuse users that use these gates for other applications as it will change results. The main issue is that for example parsing
u3(0.1,0.2,0.3) q[0];
from Qasm will create a different gate in Qibo and a different in Qiskit (and others). I guess Qiskit should be the reference for such conventions given that Qasm is developed by IBM.