Closed stavros11 closed 3 years ago
@stavros11 many thanks for these results. I don't think there is a substantial difference between the OMP branch and the master (or say 0.1.1).
From our side, it may be useful to check why Qulacs has much better performance when reducing the number of threads (particularly for single threads). We already knew from our paper benchmarks that Qulacs is faster for small circuits (<20 qubits), however now I found that it keeps being faster for larger circuits if we force single thread execution. This result is new because in our paper we benchmarked Qibo single thread but not Qulacs single thread.
I don't think there is a substantial difference between the OMP branch and the master (or say 0.1.1).
Regarding the multi-threading results (using all threads), all Qibo versions (master/stable/OMP branch) should have identical performance. So I cannot really say what went wrong in Fig. 9 and Qibo appears slower than all other libraries. I don't think it is related to the Tensorflow installation because I get similar performance to Qulacs in my local machine where I installed everything with pip without any special instructions. Both Qulacs and Qibo use double precision and utilize all threads by default, so it is very strange.
Here are some additional results using the master branch that does not have OMP. I am using a circuit of depth=10. For the 24 thread case, I use OMP_NUM_THREADS
for Qulacs and taskset
for Qibo.
Thanks for this check. If we use the OMP implementation for this latest setup (24 threads) do we get the same numbers as master?
Some additional Qibo comparisons to answer these questions.
Comparing Qibo master with Qibo OMP. I use taskset -c 1-24
to restrict Qibo master to using 24 threads without making any changes to the Tensorflow thread configuration.
Comparing using the environment variable vs using taskset to restrict the Qibo OMP threads. That is
export OMP_NUM_THREADS=24 && python main.py
vs
taskset -c 1-24 python main.py
Comparing using a custom initial state or no on Qibo master. By default initial state I mean calling circuit()
, while the np.array
is the same initial state but creating it as numpy array first, eg:
state = np.zeros(2 ** nqubits, dtype=np.complex128)
state[0] = 1
state = circuit(state)
Here are some additional comparisons between Qibo master and Qulacs on the QFT circuit that we used in our paper:
Here are the benchmarks using the updated DGX environment with the latest versions of Qibo and Qulacs.
Note that the latest Qibo master with the OMP version of custom operators was used so the results may be slightly different compared to using the latest pip release that doesn't have OMP. The OMP_NUM_THREADS
flag was used to control the number of threads for both Qibo and Qulacs (no taskset in either case).
Hi, I'm a developer of Qulacs and we discussed at qulacs/qulacs#271. Thanks for the comments on our benchmark.
This weekend, we plan to update our manuscript on arxiv, so we would like to update our benchmark results of Qibo at that time. In my understanding, when the number of threads is limited, Qibo with another version (OMP version) shows better performance than the master branch. So we would like to ask whether we can use OMP version or not and how we can install it. If we can use it, we would like to update our benchmark with it. Note that our current results and benchmark codes are in the following branch. https://github.com/qulacs/benchmark-qulacs/tree/update/qibo_benchmark
@corryvrequan thanks for your message. Please use the latest Qibo 0.1.2 version (available with pip), this version uses OpenMP instead of the default tensorflow thread pool implemented Qibo 0.1.1, so you can control the number of threads with the OMP_NUM_THREADS
env variable or using qibo.set_threads()
method (ref. docs).
The expected performance when compared to Qulacs should be similar to the last values quoted by @stavros11 in the post above https://github.com/Quantum-TII/qibo/issues/289#issuecomment-739554998.
Thanks. We have updated the benchmark results of Qibo with ver 0.1.2. Here are benchmark results in our environment (Our CPU is Xeon CPU E5-2687W v4 @ 3.00GHz x 2). All the data are also pushed to this branch.
|nqubits|Qibo v0.1.1|Qibo v0.1.2|Qulacs| |:--|:--|:--|:--| |4|0.00294871001096908|0.0027376540238037705|1.0509975254535675e-05| |5|0.0036766499979421496|0.0033816860523074865|1.4855992048978806e-05| |6|0.009322002995759249|0.0040870599914342165|2.3118220269680023e-05| |7|0.013216937994002365|0.004897958948276937|3.84538434445858e-05| |8|0.01674796300358139|0.005490804091095924|7.06082209944725e-05| |9|0.018895703993621282|0.006259039975702763|0.00013793492689728737| |10|0.02061667099769693|0.0070181930204853415|0.0002818950451910496| |11|0.023312031000386924|0.007923852070234716|0.0006292872130870819| |12|0.029612026002723724|0.008946776040829718|0.0030071567744016647| |13|0.02867704600794241|0.010560372029431164|0.002446949016302824| |14|0.03236748500785325|0.012549196020700037|0.0028049112297594547| |15|0.035263252997538075|0.014552418026141822|0.003594842739403248| |16|0.044671963012660854|0.018974922015331686|0.005314643960446119| |17|0.06596061500022188|0.02753454993944615|0.008638964034616947| |18|0.08711693700752221|0.04581278597470373|0.018452315125614405| |19|0.1314466109906789|0.08231676905415952|0.044635020196437836| |20|0.24258337399805896|0.15649215097073466|0.07033680565655231| |21|0.4574658079945948|0.31501268700230867|0.1270608389750123| |22|1.3137697930069407|0.8413914180127904|0.6339822108857334| |23|3.6993631730001653|3.5771943610161543|2.7251458917744458| |24|7.757577718992252|7.636475885985419|5.679187349975109| |25|16.240989337005885|15.301909440895543|11.548032825812697|
|nqubits|Qibo v0.1.1|Qibo v0.1.2|Qulacs| |:--|:--|:--|:--| |4|0.001868082006694749|0.002185870078392327|1.0704156011343002e-05| |5|0.0023380700004054233|0.0027288090204820037|1.523829996585846e-05| |6|0.0027933770033996552|0.003281013108789921|2.267397940158844e-05| |7|0.003302065990283154|0.003922017989680171|3.820937126874924e-05| |8|0.003932461011572741|0.004479419905692339|7.163267582654953e-05| |9|0.00481199000205379|0.005421926965937018|0.0001392900012433529| |10|0.006057663005776703|0.006833374034613371|0.00028484174981713295| |11|0.008483532001264393|0.00921010400634259|0.0007405141368508339| |12|0.01310959599504713|0.013714506989344954|0.0014588776975870132| |13|0.022640611990937032|0.023018789011985064|0.003120433073490858| |14|0.041417306012590416|0.040195634006522596|0.006873656064271927| |15|0.07997396499558818|0.07567521801684052|0.018945712130516768| |16|0.16005669200967532|0.14965874701738358|0.04034543223679066| |17|0.329957704001572|0.30495723499916494|0.08554954174906015| |18|0.6934798519941978|0.6383119520032778|0.18068483378738165| |19|1.511641694989521|1.398788389051333|0.3808698789216578| |20|3.3096117110108025|3.066347092972137|0.8014027252793312| |21|7.4784328359965|6.933786940993741|2.4606924918480217| |22|16.475088345003314|15.311792865977623|6.603768824134022| |23|35.12120026400953|32.69817558093928|13.78743761125952| |24|74.56345413799863|69.34633297298569|28.79572774004191| |25|157.52755488200637|146.7243079530308|60.05149538908154|
We observed performance improvement at all the numbers of qubits by the update of Qibo. I don't know why but the results of the single-thread benchmark are also improved.
On the other hand, we still observed about x1.3 gap at n=25 in the benchmark with 24 threads, while there is a negligible gap in the results shown by @stavros11. Since there is a larger gap in the single-thread benchmarks and since we expect multi-threading with many cores would decrease this gap, our CPUs maybe not powerful enough for closing the gap.
@corryvrequan thank you very much for looking into this issue and sharing your results. I tried using the pytest-benchmark scripts from the update/qibo_benchmark branch on our DGX machine. I used two separate environments, one based on Python 3.7 (same as your benchmarks) and one on Python 3.8 which is what I used in my last post above. Here are the results:
*note that when OMP_NUM_THREADS is not set then Qibo uses half of the available threads (20 in our case) while Qulacs uses all available threads (40). I do not observe significant performance differences from this.
There is a large performance drop in Qulacs single-thread when going from 3.7 to 3.8. @corryvrequan have you done any tests of using Qulacs with Python 3.8 that show something similar?
have you done any tests of using Qulacs with Python3.8 that show something similar?
No, we have performed pytest benchmark only in python3.7. We've checked Qulacs works at python3.8 and passes tests, but didn't check its performance. I expected that the performance is the same since Qulacs is written in C++ and exports its functions to python with pybind11. So it is unexpected behavior for me.
We checked the performance in our environment by switching python3.7 and 3.8 with pyenv, and found that there is no difference between their performance. However, when we install qulacs library from pypi, it shows about x3 degradation.
For example, at n=23, we observed the following times in the single-thread benchmark. python3.7 source build: 13.8 sec python3.7 PyPI install: 34.1 python3.8 source build: 13.5 sec python3.8 PyPI install: 34.3 sec
I guess this problem happens because a binary of qulacs that is uploaded to PyPI is built with an environment that does not support AVX2. This is possible since we changed service to build and upload binary to PyPI from TravisCI to Github Actions from ver0.2.0. We would like to fix this problem as soon as possible.
Note that this difference would disappear in the multi-thread benchmark since the performance of the multi-thread benchmark depends on memory bandwidth and SIMD optimization does not affect its performance.
If my guess is correct, I think the difference in your benchmark will disappear when you install qulacs with source-build.
pip install git+https://github.com/qulacs/qulacs
This command requires the installation of gcc, git, and cmake.
Anyway, thanks a lot for reporting this problem.
Following the publication from Qulacs, I performed some benchmarks of the circuit they use (Section IV. E). The following (expandable) sections contain the benchmark code I used and the results from the DGX CPU. I used the
OMP_NUM_THREADS
environment variable to control the number of threads which works both for Qulacs and Qibo (on the OpenMP branch). I have not used any circuit optimizations.Benchmark script
```Python import argparse import os import time import numpy as np import qibo import qulacs import circuits from datetime import datetime parser = argparse.ArgumentParser() parser.add_argument("--nqubits", default=15, type=int) parser.add_argument("--depth", default=5, type=int) parser.add_argument("--nreps", default=1, type=int) parser.add_argument("--backend", default="qibo", type=str) parser.add_argument("--filename", default=None, type=str) def qulacs_circuit(nqubits, depth): np.random.seed(123) qc = qulacs.QuantumCircuit(nqubits) for layer_count in range(depth + 1): for index in range(nqubits): angle1 = np.random.rand() * np.pi *2 angle2 = np.random.rand() * np.pi *2 angle3 = np.random.rand() * np.pi *2 qc.add_gate(qulacs.gate.RZ(index, angle1)) qc.add_gate(qulacs.gate.RX(index, angle2)) qc.add_gate(qulacs.gate.RZ(index, angle3)) if layer_count == depth: break for index in range(layer_count % 2, nqubits - 1, 2): qc.add_gate(qulacs.gate.CZ(index, index + 1)) return qc def qibo_circuit(nqubits, depth): np.random.seed(123) qc = qibo.models.Circuit(nqubits) for layer_count in range(depth + 1): for index in range(nqubits): angle1 = np.random.rand() * np.pi *2 angle2 = np.random.rand() * np.pi *2 angle3 = np.random.rand() * np.pi *2 qc.add(qibo.gates.RZ(index, -angle1)) qc.add(qibo.gates.RX(index, -angle2)) qc.add(qibo.gates.RZ(index, -angle3)) if layer_count == depth: break for index in range(layer_count % 2, nqubits - 1, 2): qc.add(qibo.gates.CZ(index, index + 1)) return qc def main(nqubits, depth, nreps, backend): logs = {"date": datetime.now().strftime("%d/%m/%Y %H:%M:%S"), "nqubits": nqubits, "depth": depth, "precision": "double", "nthreads": os.environ.get("OMP_NUM_THREADS"), "backend": backend, "device": qibo.get_device()} start_time = time.time() if backend == "qibo": circuit = circuits.qibo_circuit(nqubits, depth) elif backend == "qulacs": circuit = circuits.qulacs_circuit(nqubits, depth) else: raise ValueError logs["creation_time"] = time.time() - start_time logs["simulation_time"] = [] for _ in range(nreps): start_time = time.time() if backend == "qibo": state = circuit() elif backend == "qulacs": state = qulacs.StateVector(nqubits) circuit.update_quantum_state(state) logs["simulation_time"].append(time.time() - start_time) print("\nnqubits:", nqubits) print("Depth:", depth) print("Backend:", logs["backend"]) print("Device:", logs["device"]) print("Threads:", logs["nthreads"]) print("Creation time:", logs["creation_time"]) print("Simulation time:", logs["simulation_time"]) ```Benchmark times (40 threads)
nqubits | Qibo (sec) | Qulacs (sec) -- | -- | -- 9 | 0.0135467529296875 | 0.00012760162353515626 10 | 0.017101359367370606 | 0.0002271413803100586 11 | 0.01705753803253174 | 0.00048758983612060545 12 | 0.018274259567260743 | 0.006802892684936524 13 | 0.019514012336730956 | 0.009729647636413574 14 | 0.022945117950439454 | 0.012136435508728028 15 | 0.028107666969299318 | 0.004403305053710937 16 | 0.035989022254943846 | 0.0067020893096923825 17 | 0.045259404182434085 | 0.018850946426391603 18 | 0.07259712219238282 | 0.022517704963684083 19 | 0.11706840991973877 | 0.036876249313354495 20 | 0.20575692653656005 | 0.05715298652648926 21 | 0.4618668556213379 | 0.12476301193237305 22 | 1.3467235565185547 | 1.0517313480377197 23 | 3.680530309677124 | 3.5245158672332764 24 | 8.359740257263184 | 8.195613145828247 25 | 17.498148202896118 | 17.48403835296631 26 | 36.69878268241882 | 36.62548017501831 27 | 76.27025890350342 | 77.02987861633301 28 | 162.7022590637207 | 162.7703776359558 29 | 336.8175663948059 | 341.18127703666687 30 | 718.2868790626526 | 718.561452627182Benchmark times (10 threads)
nqubits | Qibo (sec) | Qulacs (sec) -- | -- | -- 9.0 | 0.0053038358688354496 | 0.00011837482452392578 10.0 | 0.007528448104858398 | 0.0002496004104614258 11.0 | 0.008059883117675781 | 0.0004878997802734375 12.0 | 0.010110282897949218 | 0.002512073516845703 13.0 | 0.012357401847839355 | 0.002159738540649414 14.0 | 0.01457993984222412 | 0.003020954132080078 15.0 | 0.02031111717224121 | 0.004488325119018555 16.0 | 0.030800771713256837 | 0.00736851692199707 17.0 | 0.05073738098144531 | 0.013488507270812989 18.0 | 0.09186375141143799 | 0.023573827743530274 19.0 | 0.17257814407348632 | 0.04287335872650146 20.0 | 0.347631573677063 | 0.08412048816680909 21.0 | 0.7447774410247803 | 0.20898890495300293 22.0 | 1.7739992141723633 | 1.069232702255249 23.0 | 4.268770456314087 | 3.4069290161132812 24.0 | 9.217254877090454 | 7.891895771026611 25.0 | 19.415823936462402 | 16.847090482711792 25.0 | 19.415823936462402 | 16.866304874420166 25.0 | 19.284732341766357 | 16.847090482711792 25.0 | 19.284732341766357 | 16.866304874420166 26.0 | 41.242594480514526 | 35.75888395309448 27.0 | 86.13949799537659 | 75.2289686203003 28.0 | 184.00737118721008 | 159.055438041687 29.0 | 383.02724742889404 | 333.3895092010498Benchmark times (1 thread)
nqubits | Qibo (sec) | Qulacs (sec) -- | -- | -- 9 | 0.006098055839538574 | 0.00012655258178710936 10 | 0.007689523696899414 | 0.0002488136291503906 11 | 0.01061232089996338 | 0.000488448143005371 12 | 0.015590453147888183 | 0.0023544788360595702 13 | 0.025315690040588378 | 0.0046138763427734375 14 | 0.041448569297790526 | 0.007953190803527832 15 | 0.07861685752868652 | 0.017885589599609376 16 | 0.15699667930603028 | 0.03993203639984131 17 | 0.3205324411392212 | 0.07798905372619629 18 | 0.6877103328704834 | 0.16630618572235106 19 | 1.463872456550598 | 0.3511500835418701 20 | 3.1644903421401978 | 0.7611939191818238 21 | 6.828137159347534 | 1.7662203311920166 22 | 15.616122245788574 | 5.688528537750244 23 | 33.99449372291565 | 13.873237609863281 24 | 73.18410444259644 | 29.78008246421814 25 | 154.74078154563904 | 63.10087466239929Indeed Qulacs is faster when using single thread, however when using parallelization our performance is comparable, in disagreement with Fig. 9 from the paper.