Performance comparison to Qulacs

Following the publication from Qulacs, I performed some benchmarks of the circuit they use (Section IV. E). The following (expandable) sections contain the benchmark code I used and the results from the DGX CPU. I used the OMP_NUM_THREADS environment variable to control the number of threads which works both for Qulacs and Qibo (on the OpenMP branch). I have not used any circuit optimizations.

Benchmark script

```Python import argparse import os import time import numpy as np import qibo import qulacs import circuits from datetime import datetime parser = argparse.ArgumentParser() parser.add_argument("--nqubits", default=15, type=int) parser.add_argument("--depth", default=5, type=int) parser.add_argument("--nreps", default=1, type=int) parser.add_argument("--backend", default="qibo", type=str) parser.add_argument("--filename", default=None, type=str) def qulacs_circuit(nqubits, depth): np.random.seed(123) qc = qulacs.QuantumCircuit(nqubits) for layer_count in range(depth + 1): for index in range(nqubits): angle1 = np.random.rand() * np.pi *2 angle2 = np.random.rand() * np.pi *2 angle3 = np.random.rand() * np.pi *2 qc.add_gate(qulacs.gate.RZ(index, angle1)) qc.add_gate(qulacs.gate.RX(index, angle2)) qc.add_gate(qulacs.gate.RZ(index, angle3)) if layer_count == depth: break for index in range(layer_count % 2, nqubits - 1, 2): qc.add_gate(qulacs.gate.CZ(index, index + 1)) return qc def qibo_circuit(nqubits, depth): np.random.seed(123) qc = qibo.models.Circuit(nqubits) for layer_count in range(depth + 1): for index in range(nqubits): angle1 = np.random.rand() * np.pi *2 angle2 = np.random.rand() * np.pi *2 angle3 = np.random.rand() * np.pi *2 qc.add(qibo.gates.RZ(index, -angle1)) qc.add(qibo.gates.RX(index, -angle2)) qc.add(qibo.gates.RZ(index, -angle3)) if layer_count == depth: break for index in range(layer_count % 2, nqubits - 1, 2): qc.add(qibo.gates.CZ(index, index + 1)) return qc def main(nqubits, depth, nreps, backend): logs = {"date": datetime.now().strftime("%d/%m/%Y %H:%M:%S"), "nqubits": nqubits, "depth": depth, "precision": "double", "nthreads": os.environ.get("OMP_NUM_THREADS"), "backend": backend, "device": qibo.get_device()} start_time = time.time() if backend == "qibo": circuit = circuits.qibo_circuit(nqubits, depth) elif backend == "qulacs": circuit = circuits.qulacs_circuit(nqubits, depth) else: raise ValueError logs["creation_time"] = time.time() - start_time logs["simulation_time"] = [] for _ in range(nreps): start_time = time.time() if backend == "qibo": state = circuit() elif backend == "qulacs": state = qulacs.StateVector(nqubits) circuit.update_quantum_state(state) logs["simulation_time"].append(time.time() - start_time) print("\nnqubits:", nqubits) print("Depth:", depth) print("Backend:", logs["backend"]) print("Device:", logs["device"]) print("Threads:", logs["nthreads"]) print("Creation time:", logs["creation_time"]) print("Simulation time:", logs["simulation_time"]) ```

Benchmark times (40 threads)

nqubits | Qibo (sec) | Qulacs (sec) -- | -- | -- 9 | 0.0135467529296875 | 0.00012760162353515626 10 | 0.017101359367370606 | 0.0002271413803100586 11 | 0.01705753803253174 | 0.00048758983612060545 12 | 0.018274259567260743 | 0.006802892684936524 13 | 0.019514012336730956 | 0.009729647636413574 14 | 0.022945117950439454 | 0.012136435508728028 15 | 0.028107666969299318 | 0.004403305053710937 16 | 0.035989022254943846 | 0.0067020893096923825 17 | 0.045259404182434085 | 0.018850946426391603 18 | 0.07259712219238282 | 0.022517704963684083 19 | 0.11706840991973877 | 0.036876249313354495 20 | 0.20575692653656005 | 0.05715298652648926 21 | 0.4618668556213379 | 0.12476301193237305 22 | 1.3467235565185547 | 1.0517313480377197 23 | 3.680530309677124 | 3.5245158672332764 24 | 8.359740257263184 | 8.195613145828247 25 | 17.498148202896118 | 17.48403835296631 26 | 36.69878268241882 | 36.62548017501831 27 | 76.27025890350342 | 77.02987861633301 28 | 162.7022590637207 | 162.7703776359558 29 | 336.8175663948059 | 341.18127703666687 30 | 718.2868790626526 | 718.561452627182

Benchmark times (10 threads)

nqubits | Qibo (sec) | Qulacs (sec) -- | -- | -- 9.0 | 0.0053038358688354496 | 0.00011837482452392578 10.0 | 0.007528448104858398 | 0.0002496004104614258 11.0 | 0.008059883117675781 | 0.0004878997802734375 12.0 | 0.010110282897949218 | 0.002512073516845703 13.0 | 0.012357401847839355 | 0.002159738540649414 14.0 | 0.01457993984222412 | 0.003020954132080078 15.0 | 0.02031111717224121 | 0.004488325119018555 16.0 | 0.030800771713256837 | 0.00736851692199707 17.0 | 0.05073738098144531 | 0.013488507270812989 18.0 | 0.09186375141143799 | 0.023573827743530274 19.0 | 0.17257814407348632 | 0.04287335872650146 20.0 | 0.347631573677063 | 0.08412048816680909 21.0 | 0.7447774410247803 | 0.20898890495300293 22.0 | 1.7739992141723633 | 1.069232702255249 23.0 | 4.268770456314087 | 3.4069290161132812 24.0 | 9.217254877090454 | 7.891895771026611 25.0 | 19.415823936462402 | 16.847090482711792 25.0 | 19.415823936462402 | 16.866304874420166 25.0 | 19.284732341766357 | 16.847090482711792 25.0 | 19.284732341766357 | 16.866304874420166 26.0 | 41.242594480514526 | 35.75888395309448 27.0 | 86.13949799537659 | 75.2289686203003 28.0 | 184.00737118721008 | 159.055438041687 29.0 | 383.02724742889404 | 333.3895092010498

Benchmark times (1 thread)

nqubits | Qibo (sec) | Qulacs (sec) -- | -- | -- 9 | 0.006098055839538574 | 0.00012655258178710936 10 | 0.007689523696899414 | 0.0002488136291503906 11 | 0.01061232089996338 | 0.000488448143005371 12 | 0.015590453147888183 | 0.0023544788360595702 13 | 0.025315690040588378 | 0.0046138763427734375 14 | 0.041448569297790526 | 0.007953190803527832 15 | 0.07861685752868652 | 0.017885589599609376 16 | 0.15699667930603028 | 0.03993203639984131 17 | 0.3205324411392212 | 0.07798905372619629 18 | 0.6877103328704834 | 0.16630618572235106 19 | 1.463872456550598 | 0.3511500835418701 20 | 3.1644903421401978 | 0.7611939191818238 21 | 6.828137159347534 | 1.7662203311920166 22 | 15.616122245788574 | 5.688528537750244 23 | 33.99449372291565 | 13.873237609863281 24 | 73.18410444259644 | 29.78008246421814 25 | 154.74078154563904 | 63.10087466239929

Indeed Qulacs is faster when using single thread, however when using parallelization our performance is comparable, in disagreement with Fig. 9 from the paper.

@stavros11 many thanks for these results. I don't think there is a substantial difference between the OMP branch and the master (or say 0.1.1).

From our side, it may be useful to check why Qulacs has much better performance when reducing the number of threads (particularly for single threads). We already knew from our paper benchmarks that Qulacs is faster for small circuits (<20 qubits), however now I found that it keeps being faster for larger circuits if we force single thread execution. This result is new because in our paper we benchmarked Qibo single thread but not Qulacs single thread.

I don't think there is a substantial difference between the OMP branch and the master (or say 0.1.1).

Regarding the multi-threading results (using all threads), all Qibo versions (master/stable/OMP branch) should have identical performance. So I cannot really say what went wrong in Fig. 9 and Qibo appears slower than all other libraries. I don't think it is related to the Tensorflow installation because I get similar performance to Qulacs in my local machine where I installed everything with pip without any special instructions. Both Qulacs and Qibo use double precision and utilize all threads by default, so it is very strange.

Here are some additional results using the master branch that does not have OMP. I am using a circuit of depth=10. For the 24 thread case, I use OMP_NUM_THREADS for Qulacs and taskset for Qibo.

40 threads

nqubits | Qibo master (sec) | Qulacs (sec) -- | -- | -- 9 | 0.06386494636535645 | 0.00026106834411621094 10 | 0.07952737808227539 | 0.0005218982696533203 11 | 0.0912928581237793 | 0.0009815692901611328 12 | 0.10275983810424805 | 0.02989482879638672 13 | 0.11822748184204102 | 0.0361638069152832 14 | 0.14724016189575195 | 0.04095292091369629 15 | 0.1770164966583252 | 0.016852140426635742 16 | 0.21263742446899414 | 0.020818710327148438 17 | 0.22745800018310547 | 0.06510472297668457 18 | 0.2716214656829834 | 0.044249534606933594 19 | 0.3686046600341797 | 0.16754794120788574 20 | 0.612309455871582 | 0.16237258911132812 21 | 1.1030497550964355 | 0.27549266815185547 22 | 2.9312984943389893 | 2.0805399417877197 23 | 7.574934720993042 | 7.061223268508911 24 | 16.419873476028442 | 16.333547115325928 25 | 34.728479623794556 | 34.92012000083923

24 threads

nqubits | Qibo master (sec) | Qulacs (sec) -- | -- | -- 9 | 0.05826425552368164 | 0.0002586841583251953 10 | 0.06429219245910645 | 0.0005156993865966797 11 | 0.08072686195373535 | 0.0010933876037597656 12 | 0.08716917037963867 | 0.009218454360961914 13 | 0.10715484619140625 | 0.009855985641479492 14 | 0.1232595443725586 | 0.01145625114440918 15 | 0.1417524814605713 | 0.01635289192199707 16 | 0.17467570304870605 | 0.022127866744995117 17 | 0.20429348945617676 | 0.030116796493530273 18 | 0.28555917739868164 | 0.052330970764160156 19 | 0.45632171630859375 | 0.08571624755859375 20 | 0.8392367362976074 | 0.16276764869689941 21 | 1.631040334701538 | 0.32291746139526367 22 | 3.6971654891967773 | 2.0123841762542725 23 | 9.033931016921997 | 6.865702390670776 24 | 19.456722259521484 | 15.881049156188965 25 | 41.91208624839783 | 34.0559356212616

Thanks for this check. If we use the OMP implementation for this latest setup (24 threads) do we get the same numbers as master?

Some additional Qibo comparisons to answer these questions.

Comparing Qibo master with Qibo OMP. I use taskset -c 1-24 to restrict Qibo master to using 24 threads without making any changes to the Tensorflow thread configuration.

Master vs OMP (40 threads)

nqubits | master (sec) | OMP (sec) -- | -- | -- 9 | 0.06386494636535645 | 0.026586294174194336 10 | 0.07952737808227539 | 0.02902364730834961 11 | 0.0912928581237793 | 0.03044915199279785 12 | 0.10275983810424805 | 0.026350736618041992 13 | 0.11822748184204102 | 0.03161907196044922 14 | 0.14724016189575195 | 0.048909664154052734 15 | 0.1770164966583252 | 0.041364192962646484 16 | 0.21263742446899414 | 0.09082603454589844 17 | 0.22745800018310547 | 0.13553309440612793 18 | 0.2716214656829834 | 0.18912625312805176 19 | 0.3686046600341797 | 0.2989540100097656 20 | 0.612309455871582 | 0.4795517921447754 21 | 1.1030497550964355 | 0.8699774742126465 22 | 2.9312984943389893 | 2.6755974292755127 23 | 7.574934720993042 | 7.260116100311279 24 | 16.419873476028442 | 16.17854952812195 25 | 34.728479623794556 | 34.408302783966064

Master vs OMP (24 threads)

nqubits | master (sec) | OMP (sec) -- | -- | -- 9 | 0.05826425552368164 | 0.023068666458129883 10 | 0.06429219245910645 | 0.02628493309020996 11 | 0.08072686195373535 | 0.032088518142700195 12 | 0.08716917037963867 | 0.03708386421203613 13 | 0.10715484619140625 | 0.0325009822845459 14 | 0.1232595443725586 | 0.0395355224609375 15 | 0.1417524814605713 | 0.05238151550292969 16 | 0.17467570304870605 | 0.0678563117980957 17 | 0.20429348945617676 | 0.13279175758361816 18 | 0.28555917739868164 | 0.17904281616210938 19 | 0.45632171630859375 | 0.34256410598754883 20 | 0.8392367362976074 | 0.6209831237792969 21 | 1.631040334701538 | 1.246006727218628 22 | 3.6971654891967773 | 3.108935832977295 23 | 9.033931016921997 | 7.939512729644775 24 | 19.456722259521484 | 17.341814041137695 25 | 41.91208624839783 | 36.8564293384552

Comparing using the environment variable vs using taskset to restrict the Qibo OMP threads. That is

export OMP_NUM_THREADS=24 && python main.py

taskset -c 1-24 python main.py

OMP vs OMP taskset (24 threads)

nqubits | OMP_NUM_THREADS (sec) | OMP taskset (sec) -- | -- | -- 9 | 0.023068666458129883 | 0.022091150283813477 10 | 0.02628493309020996 | 0.019970417022705078 11 | 0.032088518142700195 | 0.02152872085571289 12 | 0.03708386421203613 | 0.03344917297363281 13 | 0.0325009822845459 | 0.02587604522705078 14 | 0.0395355224609375 | 0.0358119010925293 15 | 0.05238151550292969 | 0.06155037879943848 16 | 0.0678563117980957 | 0.0854947566986084 17 | 0.13279175758361816 | 0.07672929763793945 18 | 0.17904281616210938 | 0.1705787181854248 19 | 0.34256410598754883 | 0.2916121482849121 20 | 0.6209831237792969 | 0.41102099418640137 21 | 1.246006727218628 | 0.7907612323760986 22 | 3.108935832977295 | 2.624673366546631 23 | 7.939512729644775 | 7.1120641231536865 24 | 17.341814041137695 | 16.07978844642639 25 | 36.8564293384552 | 34.32518744468689

Comparing using a custom initial state or no on Qibo master. By default initial state I mean calling circuit(), while the np.array is the same initial state but creating it as numpy array first, eg:

state = np.zeros(2 ** nqubits, dtype=np.complex128)
state[0] = 1
state = circuit(state)

Default vs Custom initial state (40 threads)

nqubits | Default initial state (sec) | `np.array` initial state (sec) -- | -- | -- 9 | 0.06386494636535645 | 0.06544756889343262 10 | 0.07952737808227539 | 0.08028411865234375 11 | 0.0912928581237793 | 0.08765578269958496 12 | 0.10275983810424805 | 0.10395216941833496 13 | 0.11822748184204102 | 0.12155532836914062 14 | 0.14724016189575195 | 0.14284467697143555 15 | 0.1770164966583252 | 0.16629600524902344 16 | 0.21263742446899414 | 0.1851818561553955 17 | 0.22745800018310547 | 0.23693418502807617 18 | 0.2716214656829834 | 0.2721834182739258 19 | 0.3686046600341797 | 0.387042760848999 20 | 0.612309455871582 | 0.6376430988311768 21 | 1.1030497550964355 | 1.1434335708618164 22 | 2.9312984943389893 | 3.0261857509613037 23 | 7.574934720993042 | 7.7317047119140625 24 | 16.419873476028442 | 16.69092631340027 25 | 34.728479623794556 | 35.31432247161865

Default vs Custom initial state (24 threads)

nqubits | Default initial state (sec) | `np.array` initial state (sec) -- | -- | -- 9 | 0.05826425552368164 | 0.054739952087402344 10 | 0.06429219245910645 | 0.06794476509094238 11 | 0.08072686195373535 | 0.07424426078796387 12 | 0.08716917037963867 | 0.09030842781066895 13 | 0.10715484619140625 | 0.10197830200195312 14 | 0.1232595443725586 | 0.12204670906066895 15 | 0.1417524814605713 | 0.14763498306274414 16 | 0.17467570304870605 | 0.17090582847595215 17 | 0.20429348945617676 | 0.20647478103637695 18 | 0.28555917739868164 | 0.28482913970947266 19 | 0.45632171630859375 | 0.48102521896362305 20 | 0.8392367362976074 | 0.8635201454162598 21 | 1.631040334701538 | 1.6626770496368408 22 | 3.6971654891967773 | 3.8127267360687256 23 | 9.033931016921997 | 9.190425157546997 24 | 19.456722259521484 | 19.80282187461853 25 | 41.91208624839783 | 42.384013652801514

Here are some additional comparisons between Qibo master and Qulacs on the QFT circuit that we used in our paper:

QFT (40 threads)

nqubits | Qibo master (sec) | Qulacs (sec) -- | -- | -- 9 | 0.007691383361816406 | 4.410743713378906e-05 10 | 0.010776996612548828 | 7.605552673339844e-05 11 | 0.011072158813476562 | 0.00014638900756835938 12 | 0.011980295181274414 | 0.0003483295440673828 13 | 0.016315937042236328 | 0.021169424057006836 14 | 0.018091917037963867 | 0.01694488525390625 15 | 0.019176721572875977 | 0.005875110626220703 16 | 0.021891355514526367 | 0.00751042366027832 17 | 0.022227048873901367 | 0.01218271255493164 18 | 0.02964496612548828 | 0.01169729232788086 19 | 0.04043912887573242 | 0.016486167907714844 20 | 0.06521821022033691 | 0.03805375099182129 21 | 0.11191534996032715 | 0.04114818572998047 22 | 0.24369335174560547 | 0.16683721542358398 23 | 0.5597691535949707 | 0.7544300556182861 24 | 1.2683031558990479 | 1.7375426292419434 25 | 2.6561715602874756 | 3.692802906036377 26 | 5.568840503692627 | 7.839385509490967 27 | 11.70875072479248 | 16.704310417175293 28 | 24.810975313186646 | 35.59144043922424 29 | 52.038503646850586 | 75.64658546447754 30 | 110.13470768928528 | 160.45931434631348

QFT (24 threads)

nqubits | Qibo master (sec) | Qulacs (sec) -- | -- | -- 9 | 0.006486177444458008 | 4.887580871582031e-05 10 | 0.00723719596862793 | 7.62939453125e-05 11 | 0.009621143341064453 | 0.00014734268188476562 12 | 0.010493278503417969 | 0.0003485679626464844 13 | 0.01195383071899414 | 0.004498720169067383 14 | 0.012378692626953125 | 0.004760265350341797 15 | 0.015028953552246094 | 0.005274057388305664 16 | 0.02168583869934082 | 0.006621122360229492 17 | 0.02221822738647461 | 0.0088653564453125 18 | 0.030266523361206055 | 0.012404203414916992 19 | 0.04762601852416992 | 0.020330190658569336 20 | 0.08574485778808594 | 0.030157089233398438 21 | 0.15418195724487305 | 0.05231666564941406 22 | 0.31192708015441895 | 0.17269587516784668 23 | 0.695305585861206 | 0.7395725250244141 24 | 1.5080311298370361 | 1.6920688152313232 25 | 3.24061918258667 | 3.612762451171875 26 | 6.969463348388672 | 7.745190143585205 27 | 14.495166063308716 | 16.552873134613037 28 | 29.7741117477417 | 35.31549787521362

QFT (10 threads)

nqubits | Qibo master (sec) | Qulacs (sec) -- | -- | -- 9 | 0.0055963993072509766 | 4.839897155761719e-05 10 | 0.006478786468505859 | 8.225440979003906e-05 11 | 0.0072824954986572266 | 0.00014591217041015625 12 | 0.008378028869628906 | 0.0003485679626464844 13 | 0.009963512420654297 | 0.003050565719604492 14 | 0.011856555938720703 | 0.0033910274505615234 15 | 0.015462398529052734 | 0.003991842269897461 16 | 0.01784372329711914 | 0.005476951599121094 17 | 0.021544694900512695 | 0.008445262908935547 18 | 0.03148460388183594 | 0.013181447982788086 19 | 0.0532221794128418 | 0.020647287368774414 20 | 0.09832429885864258 | 0.03441572189331055 21 | 0.19722318649291992 | 0.06414985656738281 22 | 0.3664119243621826 | 0.18798041343688965 23 | 0.7841160297393799 | 0.7314729690551758 24 | 1.6747472286224365 | 1.6814594268798828 25 | 3.538853168487549 | 3.6039271354675293

QFT (single thread)

nqubits | Qibo master (sec) | Qulacs (sec) -- | -- | -- 9 | 0.0016360282897949219 | 4.744529724121094e-05 10 | 0.0017609596252441406 | 8.273124694824219e-05 11 | 0.002162456512451172 | 0.00014638900756835938 12 | 0.0030889511108398438 | 0.00031256675720214844 13 | 0.004583597183227539 | 0.0032501220703125 14 | 0.008087635040283203 | 0.005284309387207031 15 | 0.014760494232177734 | 0.010275602340698242 16 | 0.029447555541992188 | 0.01959228515625 17 | 0.061279296875 | 0.03200173377990723 18 | 0.13141655921936035 | 0.060399770736694336 19 | 0.2844541072845459 | 0.12215065956115723 20 | 0.6210451126098633 | 0.23827195167541504 21 | 1.3527207374572754 | 0.4935646057128906 22 | 3.0315840244293213 | 1.3092334270477295 23 | 6.684236526489258 | 3.193127393722534 24 | 14.492893695831299 | 6.889297008514404 25 | 31.145760536193848 | 14.675868272781372

Here are the benchmarks using the updated DGX environment with the latest versions of Qibo and Qulacs.

Random circuit of depth 10 (40 threads)

nqubits | Qibo (sec) | Qulacs (sec) -- | -- | -- 9 | 0.02345438003540039 | 0.0008822202682495118 10 | 0.025286126136779784 | 0.0019303083419799805 11 | 0.03154418468475342 | 0.004235553741455078 12 | 0.03667738437652588 | 0.014275407791137696 13 | 0.046860480308532716 | 0.010443830490112304 14 | 0.04571774005889893 | 0.012235641479492188 15 | 0.07567157745361328 | 0.018554282188415528 16 | 0.0726731538772583 | 0.025031661987304686 17 | 0.10444118976593017 | 0.041064906120300296 18 | 0.14853098392486572 | 0.06786632537841797 19 | 0.24282634258270264 | 0.12545907497406006 20 | 0.4263880968093872 | 0.2455292224884033 21 | 0.8486113548278809 | 0.5036683082580566 22 | 2.6755216121673584 | 2.2829651832580566 23 | 7.436310052871704 | 7.403403997421265 24 | 16.96751642227173 | 16.83192276954651 25 | 35.57708430290222 | 35.80826425552368 26 | 74.21958446502686 | 75.00906729698181 27 | 154.9695110321045 | 157.9432110786438 28 | 324.6361494064331 | 330.8528699874878 29 | 676.4610238075256 | 691.8368184566498 30 | 1416.0771219730377 | 1452.8253862857819

Random circuit of depth 10 (24 threads)

nqubits | Qibo (sec) | Qulacs (sec) -- | -- | -- 9 | 0.013836550712585449 | 0.0008792638778686524 10 | 0.01576552391052246 | 0.0019336462020874024 11 | 0.0183255672454834 | 0.004236245155334472 12 | 0.02176172733306885 | 0.008774089813232421 13 | 0.025570893287658693 | 0.006769037246704102 14 | 0.03136751651763916 | 0.010223817825317384 15 | 0.04166858196258545 | 0.01575436592102051 16 | 0.0586280345916748 | 0.02558910846710205 17 | 0.10471093654632568 | 0.0422478437423706 18 | 0.16116709709167482 | 0.08919520378112793 19 | 0.2938584327697754 | 0.17489175796508788 20 | 0.6046590089797974 | 0.358087158203125 21 | 1.213449239730835 | 0.7475347518920898 22 | 3.0311617851257324 | 2.486741542816162 23 | 7.973522663116455 | 7.405093193054199 24 | 17.38332200050354 | 16.90325117111206 25 | 36.97477316856384 | 36.18343544006348 26 | 78.06760573387146 | 76.31126117706299 27 | 164.95480394363403 | 161.10986733436584 28 | 347.36835741996765 | 339.0132899284363 29 | 732.0959684848785 | 712.2085890769958 30 | 1539.1300518512726 | 1488.3009464740753

Random circuit of depth 10 (10 threads)

nqubits | Qibo (sec) | Qulacs (sec) -- | -- | -- 9 | 0.01042006015777588 | 0.0008802652359008789 10 | 0.01290113925933838 | 0.0019324779510498046 11 | 0.015623068809509278 | 0.00423898696899414 12 | 0.021000003814697264 | 0.00824284553527832 13 | 0.024518632888793947 | 0.006219720840454102 14 | 0.03075430393218994 | 0.009922337532043458 15 | 0.04059555530548096 | 0.016051936149597167 16 | 0.06363887786865234 | 0.02888963222503662 17 | 0.1029695749282837 | 0.05733768939971924 18 | 0.18458752632141112 | 0.1115729570388794 19 | 0.35025639533996583 | 0.22556014060974122 20 | 0.6918323040008545 | 0.47941572666168214 21 | 1.4401216506958008 | 1.0077471733093262 22 | 3.48836088180542 | 2.938305377960205 23 | 8.5156729221344 | 7.76324987411499 24 | 18.37932515144348 | 17.34204912185669 25 | 38.843780517578125 | 37.05875825881958 26 | 81.63395762443542 | 77.9666862487793 27 | 172.2007384300232 | 164.42143988609314 28 | 363.5125150680542 | 345.9102659225464 29 | 767.1340291500092 | 727.2075533866882 30 | 1613.8979403972626 | 1529.3086771965027

Random circuit of depth 10 (1 thread)

nqubits | Qibo (sec) | Qulacs (sec) -- | -- | -- 9 | 0.011773896217346192 | 0.0008796930313110351 10 | 0.014267897605895996 | 0.0019314289093017578 11 | 0.01845505237579346 | 0.004235553741455078 12 | 0.027586030960083007 | 0.009842634201049805 13 | 0.046300840377807614 | 0.021618199348449708 14 | 0.08324337005615234 | 0.04646358489990234 15 | 0.1539909839630127 | 0.10171213150024414 16 | 0.3064578056335449 | 0.21869347095489503 17 | 0.6382689952850342 | 0.4631695032119751 18 | 1.3538481712341308 | 0.9885856151580811 19 | 2.933026647567749 | 2.0985185146331786 20 | 6.2552131652832035 | 4.458181023597717 21 | 13.485628366470337 | 9.4023277759552 22 | 30.758999824523926 | 22.73469567298889 23 | 67.48335719108582 | 49.80411076545715 24 | 144.45745420455933 | 105.25369572639465 25 | 309.03635692596436 | 221.3708610534668 26 | 655.8184731006622 | 465.9312424659729 27 | 1398.3749043941498 | 972.7301976680756 28 | 2957.6747369766235 | 2034.9552040100098 29 | 6281.681274175644 | 4260.959970712662 30 | in progress | in progress

QFT (40 threads)

nqubits | Qibo (sec) | Qulacs (sec) -- | -- | -- 9 | 0.002599620819091797 | 9.541511535644531e-05 10 | 0.002535891532897949 | 0.00022275447845458983 11 | 0.003021836280822754 | 0.0005274772644042969 12 | 0.0037745237350463867 | 0.0012461185455322266 13 | 0.004059696197509765 | 0.0016709566116333008 14 | 0.004764866828918457 | 0.00201568603515625 15 | 0.005788850784301758 | 0.00266420841217041 16 | 0.008490967750549316 | 0.003978180885314942 17 | 0.01347362995147705 | 0.006902813911437988 18 | 0.014561605453491212 | 0.013072586059570313 19 | 0.022318935394287108 | 0.023389792442321776 20 | 0.03546571731567383 | 0.04680724143981933 21 | 0.07227778434753418 | 0.16387605667114258 22 | 0.19549059867858887 | 0.2649855613708496 23 | 0.515087366104126 | 0.7897136211395264 24 | 1.2809879779815674 | 1.800313949584961 25 | 2.6882359981536865 | 3.7918128967285156 26 | 5.626779556274414 | 8.175470352172852 27 | 11.824127197265625 | 17.230283737182617 28 | 24.779775142669678 | 36.76169276237488 29 | 52.1685049533844 | 77.47800159454346 30 | 109.54870629310608 | 163.93864059448242

QFT (24 threads)

nqubits | Qibo (sec) | Qulacs (sec) -- | -- | -- 9 | 0.0013163089752197266 | 9.551048278808594e-05 10 | 0.0016589641571044921 | 0.00022492408752441406 11 | 0.001855921745300293 | 0.0005275487899780273 12 | 0.0026516437530517576 | 0.0012465238571166993 13 | 0.0036220788955688477 | 0.0009086370468139648 14 | 0.0032163381576538084 | 0.0015579462051391602 15 | 0.004648756980895996 | 0.002448272705078125 16 | 0.005497455596923828 | 0.004516148567199707 17 | 0.008966922760009766 | 0.008761024475097657 18 | 0.014380931854248047 | 0.016656780242919923 19 | 0.02631826400756836 | 0.03097810745239258 20 | 0.04829552173614502 | 0.06812539100646972 21 | 0.10241436958312988 | 0.1705923080444336 22 | 0.2549417018890381 | 0.37230491638183594 23 | 0.5849838256835938 | 0.8819682598114014 24 | 1.3217315673828125 | 1.977656364440918 25 | 2.7972776889801025 | 4.227981805801392 26 | 5.959147691726685 | 8.981754779815674 27 | 12.561394453048706 | 19.226110458374023 28 | 26.62592315673828 | 41.04967546463013 29 | 56.143144845962524 | 87.39334344863892 30 | 118.76607656478882 | 185.7503387928009

QFT (10 threads)

nqubits | Qibo (sec) | Qulacs (sec) -- | -- | -- 9 | 0.00109405517578125 | 9.489059448242188e-05 10 | 0.0012802839279174804 | 0.0002226591110229492 11 | 0.001551985740661621 | 0.0005287647247314454 12 | 0.0019714832305908203 | 0.0012487649917602539 13 | 0.0022492408752441406 | 0.000998544692993164 14 | 0.0032849788665771486 | 0.0018195867538452148 15 | 0.004219722747802734 | 0.003013467788696289 16 | 0.005895447731018066 | 0.005614423751831054 17 | 0.009607958793640136 | 0.011360526084899902 18 | 0.016813945770263673 | 0.022191333770751952 19 | 0.03000519275665283 | 0.04479734897613526 20 | 0.05757217407226563 | 0.09392812252044677 21 | 0.1405935287475586 | 0.22571945190429688 22 | 0.2900967597961426 | 0.4907212257385254 23 | 0.6764752864837646 | 1.1043751239776611 24 | 1.4297668933868408 | 2.4410810470581055 25 | 3.0052382946014404 | 5.117470026016235 26 | 6.339479684829712 | 10.900100708007812 27 | 13.3707115650177 | 23.325488090515137 28 | 28.304474353790283 | 49.974684715270996 29 | 59.86444568634033 | 106.5264778137207 30 | 126.67581963539124 | 227.13434195518494

QFT (1 thread)

nqubits | Qibo (sec) | Qulacs (sec) -- | -- | -- 9 | 0.0010200262069702149 | 9.450912475585937e-05 10 | 0.0012471914291381837 | 0.0002727985382080078 11 | 0.0016186237335205078 | 0.0005277156829833985 12 | 0.002383875846862793 | 0.001245594024658203 13 | 0.0038979768753051756 | 0.002970600128173828 14 | 0.006423568725585938 | 0.006807470321655273 15 | 0.01198141574859619 | 0.01556556224822998 16 | 0.022649264335632323 | 0.038105201721191403 17 | 0.047075963020324706 | 0.0792776107788086 18 | 0.10040547847747802 | 0.1777721405029297 19 | 0.2163771629333496 | 0.3961259126663208 20 | 0.47072515487670896 | 0.8844536781311035 21 | 1.0233197212219238 | 1.943563461303711 22 | 2.3626465797424316 | 4.413533926010132 23 | 5.1685521602630615 | 9.723094463348389 24 | 11.299541234970093 | 21.161593675613403 25 | 24.069123029708862 | 45.78824162483215 26 | 51.35245370864868 | 98.85165739059448 27 | 110.12905550003052 | 212.7115342617035 28 | 234.7541003227234 | 457.1432023048401 29 | 500.55779910087585 | 979.0429422855377 30 | 1063.1911923885345 | 2093.0531504154205

Note that the latest Qibo master with the OMP version of custom operators was used so the results may be slightly different compared to using the latest pip release that doesn't have OMP. The OMP_NUM_THREADS flag was used to control the number of threads for both Qibo and Qulacs (no taskset in either case).

Hi, I'm a developer of Qulacs and we discussed at qulacs/qulacs#271. Thanks for the comments on our benchmark.

This weekend, we plan to update our manuscript on arxiv, so we would like to update our benchmark results of Qibo at that time. In my understanding, when the number of threads is limited, Qibo with another version (OMP version) shows better performance than the master branch. So we would like to ask whether we can use OMP version or not and how we can install it. If we can use it, we would like to update our benchmark with it. Note that our current results and benchmark codes are in the following branch. https://github.com/qulacs/benchmark-qulacs/tree/update/qibo_benchmark

@corryvrequan thanks for your message. Please use the latest Qibo 0.1.2 version (available with pip), this version uses OpenMP instead of the default tensorflow thread pool implemented Qibo 0.1.1, so you can control the number of threads with the OMP_NUM_THREADS env variable or using qibo.set_threads() method (ref. docs).

The expected performance when compared to Qulacs should be similar to the last values quoted by @stavros11 in the post above https://github.com/Quantum-TII/qibo/issues/289#issuecomment-739554998.

Thanks. We have updated the benchmark results of Qibo with ver 0.1.2. Here are benchmark results in our environment (Our CPU is Xeon CPU E5-2687W v4 @ 3.00GHz x 2). All the data are also pushed to this branch.

24 threads

|nqubits|Qibo v0.1.1|Qibo v0.1.2|Qulacs| |:--|:--|:--|:--| |4|0.00294871001096908|0.0027376540238037705|1.0509975254535675e-05| |5|0.0036766499979421496|0.0033816860523074865|1.4855992048978806e-05| |6|0.009322002995759249|0.0040870599914342165|2.3118220269680023e-05| |7|0.013216937994002365|0.004897958948276937|3.84538434445858e-05| |8|0.01674796300358139|0.005490804091095924|7.06082209944725e-05| |9|0.018895703993621282|0.006259039975702763|0.00013793492689728737| |10|0.02061667099769693|0.0070181930204853415|0.0002818950451910496| |11|0.023312031000386924|0.007923852070234716|0.0006292872130870819| |12|0.029612026002723724|0.008946776040829718|0.0030071567744016647| |13|0.02867704600794241|0.010560372029431164|0.002446949016302824| |14|0.03236748500785325|0.012549196020700037|0.0028049112297594547| |15|0.035263252997538075|0.014552418026141822|0.003594842739403248| |16|0.044671963012660854|0.018974922015331686|0.005314643960446119| |17|0.06596061500022188|0.02753454993944615|0.008638964034616947| |18|0.08711693700752221|0.04581278597470373|0.018452315125614405| |19|0.1314466109906789|0.08231676905415952|0.044635020196437836| |20|0.24258337399805896|0.15649215097073466|0.07033680565655231| |21|0.4574658079945948|0.31501268700230867|0.1270608389750123| |22|1.3137697930069407|0.8413914180127904|0.6339822108857334| |23|3.6993631730001653|3.5771943610161543|2.7251458917744458| |24|7.757577718992252|7.636475885985419|5.679187349975109| |25|16.240989337005885|15.301909440895543|11.548032825812697|

single thread

|nqubits|Qibo v0.1.1|Qibo v0.1.2|Qulacs| |:--|:--|:--|:--| |4|0.001868082006694749|0.002185870078392327|1.0704156011343002e-05| |5|0.0023380700004054233|0.0027288090204820037|1.523829996585846e-05| |6|0.0027933770033996552|0.003281013108789921|2.267397940158844e-05| |7|0.003302065990283154|0.003922017989680171|3.820937126874924e-05| |8|0.003932461011572741|0.004479419905692339|7.163267582654953e-05| |9|0.00481199000205379|0.005421926965937018|0.0001392900012433529| |10|0.006057663005776703|0.006833374034613371|0.00028484174981713295| |11|0.008483532001264393|0.00921010400634259|0.0007405141368508339| |12|0.01310959599504713|0.013714506989344954|0.0014588776975870132| |13|0.022640611990937032|0.023018789011985064|0.003120433073490858| |14|0.041417306012590416|0.040195634006522596|0.006873656064271927| |15|0.07997396499558818|0.07567521801684052|0.018945712130516768| |16|0.16005669200967532|0.14965874701738358|0.04034543223679066| |17|0.329957704001572|0.30495723499916494|0.08554954174906015| |18|0.6934798519941978|0.6383119520032778|0.18068483378738165| |19|1.511641694989521|1.398788389051333|0.3808698789216578| |20|3.3096117110108025|3.066347092972137|0.8014027252793312| |21|7.4784328359965|6.933786940993741|2.4606924918480217| |22|16.475088345003314|15.311792865977623|6.603768824134022| |23|35.12120026400953|32.69817558093928|13.78743761125952| |24|74.56345413799863|69.34633297298569|28.79572774004191| |25|157.52755488200637|146.7243079530308|60.05149538908154|

We observed performance improvement at all the numbers of qubits by the update of Qibo. I don't know why but the results of the single-thread benchmark are also improved.

On the other hand, we still observed about x1.3 gap at n=25 in the benchmark with 24 threads, while there is a negligible gap in the results shown by @stavros11. Since there is a larger gap in the single-thread benchmarks and since we expect multi-threading with many cores would decrease this gap, our CPUs maybe not powerful enough for closing the gap.

@corryvrequan thank you very much for looking into this issue and sharing your results. I tried using the pytest-benchmark scripts from the update/qibo_benchmark branch on our DGX machine. I used two separate environments, one based on Python 3.7 (same as your benchmarks) and one on Python 3.8 which is what I used in my last post above. Here are the results:

Python 3.8 / OMP_NUM_THREADS not set*

nqubits | Qibo v0.1.2 | Qulacs -- | -- | -- 4 | 0.003150053322315216 | 2.066977322101593e-05 5 | 0.003920549992471933 | 4.01143915951252e-05 6 | 0.004693770781159401 | 8.202390745282173e-05 7 | 0.005475031677633524 | 0.00017678504809737206 8 | 0.006256961263716221 | 0.00038702040910720825 9 | 0.00699857110157609 | 0.0008499459363520145 10 | 0.00778554193675518 | 0.0018722331151366234 11 | 0.008567169308662415 | 0.004107045009732246 12 | 0.009512801188975573 | 0.011189219076186419 13 | 0.010815986897796392 | 0.004276743624359369 14 | 0.012724513188004494 | 0.0055667562410235405 15 | 0.015777796972543 | 0.00797576503828168 16 | 0.021382241044193506 | 0.012996002100408077 17 | 0.032171162776649 | 0.02270021615549922 18 | 0.05476263212040067 | 0.04468162590637803 19 | 0.09274253714829683 | 0.08860970381647348 20 | 0.1760302963666618 | 0.18759208964183927 21 | 0.3572122738696635 | 0.3879277710802853 22 | 1.3785248333588243 | 1.397703423164785 23 | 4.050146180670708 | 4.183017033152282 24 | 8.45918232621625 | 8.802979567553848 25 | 17.5604213392362 | 18.202873657923192

Python 3.8 / OMP_NUM_THREADS=24

nqubits | Qibo v0.1.2 | Qulacs -- | -- | -- 4 | 0.0032554343342781067 | 2.0699109882116318e-05 5 | 0.004056516103446484 | 4.024617373943329e-05 6 | 0.006386289838701487 | 8.221017196774483e-05 7 | 0.0074988300912082195 | 0.0001767878420650959 8 | 0.008519253693521023 | 0.0003859391435980797 9 | 0.009630042128264904 | 0.0008495091460645199 10 | 0.010717372875660658 | 0.0018724850378930569 11 | 0.011912751011550426 | 0.004107260145246983 12 | 0.01342700980603695 | 0.00707498611882329 13 | 0.01528074685484171 | 0.00343893701210618 14 | 0.018210016656666994 | 0.0057183499448001385 15 | 0.02319854311645031 | 0.009289553854614496 16 | 0.03185692382976413 | 0.0170172699727118 17 | 0.048959293868392706 | 0.03303319774568081 18 | 0.0806436906568706 | 0.06705044116824865 19 | 0.14580900594592094 | 0.13892179587855935 20 | 0.28091586800292134 | 0.28817541943863034 21 | 0.5671115601435304 | 0.6027778796851635 22 | 1.5811811699531972 | 1.6052248822525144 23 | 4.244381193071604 | 4.307255341205746 24 | 8.85388502990827 | 9.100441257003695 25 | 18.356428754050285 | 18.95167886186391

Python 3.8 / OMP_NUM_THREADS=1

nqubits | Qibo v0.1.2 | Qulacs -- | -- | -- 4 | 0.0024420651607215405 | 2.0739156752824783e-05 5 | 0.0031044171191751957 | 3.9993785321712494e-05 6 | 0.0037248190492391586 | 8.230004459619522e-05 7 | 0.004402149002999067 | 0.00017662718892097473 8 | 0.005196200218051672 | 0.0003864280879497528 9 | 0.006173561327159405 | 0.0008491417393088341 10 | 0.007793554104864597 | 0.0018717283383011818 11 | 0.01042993227019906 | 0.004106108099222183 12 | 0.015575506258755922 | 0.009433350060135126 13 | 0.025422278326004744 | 0.02013359311968088 14 | 0.04536999901756644 | 0.042781597934663296 15 | 0.08441018499433994 | 0.09211157402023673 16 | 0.16621606098487973 | 0.19485280942171812 17 | 0.33842776296660304 | 0.4140959810465574 18 | 0.7031812346540391 | 0.8781991377472878 19 | 1.4740820080041885 | 1.8573450152762234 20 | 3.079305292107165 | 3.8906321730464697 21 | 6.473294187802821 | 8.120214707683772 22 | 14.054386139847338 | 17.544380482751876 23 | 29.762901202775538 | 37.061205023899674 24 | 62.26098427083343 | 77.30938994325697 25 | 130.8279190249741 | 160.96348170610145

Python 3.7 / OMP_NUM_THREADS not set*

nqubits | Qibo v0.1.2 | Qulacs -- | -- | -- 4 | 0.0025729737244546413 | 1.6035046428442e-05 5 | 0.003198087215423584 | 2.17328779399395e-05 6 | 0.0038622389547526836 | 3.238394856452942e-05 7 | 0.0044931648299098015 | 5.187792703509331e-05 8 | 0.005134868901222944 | 9.17571596801281e-05 9 | 0.005786390975117683 | 0.00017226580530405045 10 | 0.006468452978879213 | 0.00033894041553139687 11 | 0.007197671104222536 | 0.0007451516576111317 12 | 0.008017745334655046 | 0.0015509300865232944 13 | 0.009212154895067215 | 0.001614387147128582 14 | 0.011056025978177786 | 0.0020633498206734657 15 | 0.013956471811980009 | 0.002803904004395008 16 | 0.01935456693172455 | 0.0041845678351819515 17 | 0.030294766183942556 | 0.006838225293904543 18 | 0.05252336198464036 | 0.011971790809184313 19 | 0.09190844278782606 | 0.02689019264653325 20 | 0.1811795928515494 | 0.055393317714333534 21 | 0.3574471101164818 | 0.12250528018921614 22 | 1.4092470579780638 | 1.3189368820749223 23 | 4.05847776401788 | 4.059799836948514 24 | 8.51019597472623 | 8.5683405501768 25 | 17.63591739581898 | 17.734327861107886

Python 3.7 / OMP_NUM_THREADS=24

nqubits | Qibo v0.1.2 | Qulacs -- | -- | -- 4 | 0.00408749096095562 | 1.6065314412117004e-05 5 | 0.005001675337553024 | 2.2524967789649963e-05 6 | 0.0059582567773759365 | 3.261910751461983e-05 7 | 0.006952857133001089 | 5.207676440477371e-05 8 | 0.007961981929838657 | 9.143305942416191e-05 9 | 0.008990589063614607 | 0.00017180200666189194 10 | 0.010068600066006184 | 0.0003396649844944477 11 | 0.011241007130593061 | 0.0007441160269081593 12 | 0.012923228088766336 | 0.0011784280650317669 13 | 0.014695786871016026 | 0.0014041359536349773 14 | 0.0303602977655828 | 0.0019486378878355026 15 | 0.02225100575014949 | 0.0029638707637786865 16 | 0.03116257395595312 | 0.004875851329416037 17 | 0.04846762726083398 | 0.009048178791999817 18 | 0.08108341414481401 | 0.018041246104985476 19 | 0.1450611213222146 | 0.037378060165792704 20 | 0.2810919373296201 | 0.0786556382663548 21 | 0.5690857581794262 | 0.17204649420455098 22 | 1.5617323070764542 | 1.2711464129388332 23 | 4.255482057109475 | 3.9402892673388124 24 | 8.860675815027207 | 8.332264070864767 25 | 18.346935159992427 | 17.35919495811686

Python 3.7 / OMP_NUM_THREADS=1

nqubits | Qibo v0.1.2 | Qulacs -- | -- | -- 4 | 0.002148816827684641 | 1.587904989719391e-05 5 | 0.0026690270751714706 | 2.4524051696062088e-05 6 | 0.003202277235686779 | 3.358907997608185e-05 7 | 0.003812578972429037 | 5.2230898290872574e-05 8 | 0.004533152095973492 | 9.261257946491241e-05 9 | 0.0055089988745749 | 0.0001721186563372612 10 | 0.006883068010210991 | 0.00033859116956591606 11 | 0.009468059986829758 | 0.0007395520806312561 12 | 0.014534695073962212 | 0.0018078959546983242 13 | 0.02440448570996523 | 0.0037675928324460983 14 | 0.04424654506146908 | 0.008582137059420347 15 | 0.08329207077622414 | 0.02068642294034362 16 | 0.1649139760993421 | 0.04746612999588251 17 | 0.33900544326752424 | 0.09944330807775259 18 | 0.7032381799072027 | 0.20790547225624323 19 | 1.4713963479734957 | 0.4375013578683138 20 | 3.0839053350500762 | 0.9159023221582174 21 | 6.501369597390294 | 1.9587119072675705 22 | 14.173907249700278 | 6.117727516684681 23 | 29.960908339824528 | 14.205300574656576 24 | 62.81519587012008 | 29.626069315709174 25 | 131.90895759733394 | 61.81920428108424

*note that when OMP_NUM_THREADS is not set then Qibo uses half of the available threads (20 in our case) while Qulacs uses all available threads (40). I do not observe significant performance differences from this.

There is a large performance drop in Qulacs single-thread when going from 3.7 to 3.8. @corryvrequan have you done any tests of using Qulacs with Python 3.8 that show something similar?

have you done any tests of using Qulacs with Python3.8 that show something similar?

No, we have performed pytest benchmark only in python3.7. We've checked Qulacs works at python3.8 and passes tests, but didn't check its performance. I expected that the performance is the same since Qulacs is written in C++ and exports its functions to python with pybind11. So it is unexpected behavior for me.

We checked the performance in our environment by switching python3.7 and 3.8 with pyenv, and found that there is no difference between their performance. However, when we install qulacs library from pypi, it shows about x3 degradation.

For example, at n=23, we observed the following times in the single-thread benchmark. python3.7 source build: 13.8 sec python3.7 PyPI install: 34.1 python3.8 source build: 13.5 sec python3.8 PyPI install: 34.3 sec

I guess this problem happens because a binary of qulacs that is uploaded to PyPI is built with an environment that does not support AVX2. This is possible since we changed service to build and upload binary to PyPI from TravisCI to Github Actions from ver0.2.0. We would like to fix this problem as soon as possible.

Note that this difference would disappear in the multi-thread benchmark since the performance of the multi-thread benchmark depends on memory bandwidth and SIMD optimization does not affect its performance.

If my guess is correct, I think the difference in your benchmark will disappear when you install qulacs with source-build.

pip install git+https://github.com/qulacs/qulacs

This command requires the installation of gcc, git, and cmake.

Anyway, thanks a lot for reporting this problem.

qiboteam / qibo

Performance comparison to Qulacs #289