Benchmark external libraries

Adds a template and script for benchmarking external quantum simulation libraries different than Qibo (fixes #10). We should cover at least the libraries included in HyQuas benchmark paper. Here is a list of required libraries:

Python:

[x] QCGPU
[x] Qiskit
[x] Qulacs
[x] qsim (+ cuQuantum)
[x] qibotf, tensorflow
[x] projectq
[x] hybridq NASA
[x] cuQuantum?

These benchmarks can be executed using the new compare.py script and the library is selected using the --library flag.

The supported libraries are defined under benchmarks/libaries and the goal is to support all circuits included the Qibo benchmark for all libraries. This works by defining every circuit using OpenQASM and then build each library's circuit from this. This is straightforward for libraries that have built-in Qasm loaders such as Qiskit and Qibo, while for the rest (eg. Qulacs) I use the Qasm parser we have in Qibo modified to add the gates from the corresponding library. All circuits we have here can be written in the Qasm format we support in Qibo except perhaps QAOA which contains some RZZ gates which we do not have built-in in Qibo.

Next steps for this PR:

[x] Write QAOA circuit in Qasm format that is succesfully read by all libraries.
[ ] Add the above libraries.
[x] Add GPU support for the libraries that support it (currently only CPU is implemented).
[x] Add possibility to use different qibo backends (eg. instead of just --library qibo we should have --library qibojit, etc.)
[ ] Run benchmarks.

Note: I noticed that Qibo's U2 and U3 gates follow a different parameter convention when compared to Qiskit and other libraries. For example check our docs vs Qiskit's docs. This should not affect performance which is what we mainly care about here but it may confuse users that use these gates for other applications as it will change results. The main issue is that for example parsing u3(0.1,0.2,0.3) q[0]; from Qasm will create a different gate in Qibo and a different in Qiskit (and others). I guess Qiskit should be the reference for such conventions given that Qasm is developed by IBM.

Here are some numbers using the compare.py script for Qibo (default qibojit backend), qiskit and qulacs, all on qibo machine CPU. Note that unlike our first paper Qiskit is using all threads and performance is particularly good. I confirmed in several cases that the correct wavefunction is returned, so the simulation is not skipped. I am not sure if they do some kind of circuit simplification to achieve that performance.

@scarrazza you can confirm Qiskit's performance by running something simple, eg a QFT for 30 qubits: python compare.py --nqubits 30 --circuit qft --library qiskit. On qibo machine this takes 37sec with Qiskit, 50sec with Qibo and 80sec with Qulacs.

CPU - dry run times - qft

nqubits | qibo | qibotf | dry_run_time_qiskit | dry_run_time_qulacs -- | -- | -- | -- | -- 3 | 0.14609 | 0.01614 | 0.00141 | 0.00003 4 | 0.14219 | 0.02567 | 0.00114 | 0.00003 5 | 0.13952 | 0.01790 | 0.00106 | 0.00004 6 | 0.14063 | 0.01482 | 0.00114 | 0.00005 7 | 0.14224 | 0.03153 | 0.00129 | 0.00004 8 | 0.14368 | 0.02951 | 0.00160 | 0.00008 9 | 0.14293 | 0.02227 | 0.00157 | 0.00015 10 | 0.14489 | 0.02659 | 0.00196 | 0.00029 11 | 0.14266 | 0.02941 | 0.00227 | 0.00067 12 | 0.14414 | 0.02864 | 0.00305 | 0.00095 13 | 0.14532 | 0.01835 | 0.00476 | 0.04251 14 | 0.14601 | 0.02937 | 0.00911 | 0.08639 15 | 0.15146 | 0.02963 | 0.03047 | 0.05059 16 | 0.15435 | 0.03529 | 0.03393 | 0.05788 17 | 0.14970 | 0.03201 | 0.03692 | 0.08698 18 | 0.15598 | 0.03784 | 0.04524 | 0.04941 19 | 0.15866 | 0.03899 | 0.05674 | 0.16675 20 | 0.16004 | 0.04898 | 0.07397 | 0.10116 21 | 0.17945 | 0.06781 | 0.09618 | 0.16086 22 | 0.18606 | 0.09782 | 0.11774 | 0.17884 23 | 0.24259 | 0.19192 | 0.17108 | 0.30569 24 | 0.48633 | 0.41005 | 0.38000 | 0.62176 25 | 1.13068 | 0.95941 | 0.91085 | 1.52635 26 | 2.75894 | 2.47023 | 1.88619 | 3.97448

CPU - dry run times - variational

nqubits | qibo | qibotf | dry_run_time_qiskit | dry_run_time_qulacs -- | -- | -- | -- | -- 3 | 0.13821 | 0.02626 | 0.00127 | 0.00005 4 | 0.13915 | 0.01163 | 0.00100 | 0.00004 5 | 0.13660 | 0.02306 | 0.00105 | 0.00004 6 | 0.13939 | 0.01711 | 0.00124 | 0.00005 7 | 0.13899 | 0.01630 | 0.00120 | 0.00006 8 | 0.14083 | 0.04023 | 0.00121 | 0.00010 9 | 0.13952 | 0.03896 | 0.00158 | 0.00009 10 | 0.13938 | 0.02687 | 0.00136 | 0.00026 11 | 0.14355 | 0.01670 | 0.00151 | 0.00051 12 | 0.14013 | 0.02445 | 0.00168 | 0.00108 13 | 0.14623 | 0.01961 | 0.00210 | 0.06769 14 | 0.14328 | 0.03573 | 0.00346 | 0.05836 15 | 0.14111 | 0.01913 | 0.00764 | 0.08532 16 | 0.14549 | 0.02167 | 0.00842 | 0.03759 17 | 0.14837 | 0.02696 | 0.00839 | 0.07152 18 | 0.14382 | 0.02851 | 0.00863 | 0.08242 19 | 0.14968 | 0.02721 | 0.01136 | 0.06706 20 | 0.15033 | 0.03511 | 0.01158 | 0.09843 21 | 0.15262 | 0.04414 | 0.01776 | 0.11228 22 | 0.16749 | 0.05572 | 0.02055 | 0.09962 23 | 0.20121 | 0.12087 | 0.03393 | 0.13973 24 | 0.34273 | 0.26135 | 0.05600 | 0.30457 25 | 0.66060 | 0.56991 | 0.14690 | 0.74243 26 | 1.22694 | 1.18731 | 0.28349 | 1.44867

CPU - dry run times - bv

nqubits | qibo | qibotf | dry_run_time_qiskit | dry_run_time_qulacs -- | -- | -- | -- | -- 3 | 0.14355 | 0.02230 | 0.00104 | 0.00004 4 | 0.15346 | 0.05262 | 0.00108 | 0.00004 5 | 0.14235 | 0.01966 | 0.00124 | 0.00004 6 | 0.14811 | 0.02366 | 0.00121 | 0.00005 7 | 0.14192 | 0.02395 | 0.00132 | 0.00004 8 | 0.14303 | 0.03168 | 0.00112 | 0.00005 9 | 0.14264 | 0.02597 | 0.00123 | 0.00006 10 | 0.14080 | 0.03298 | 0.00126 | 0.00009 11 | 0.14738 | 0.01784 | 0.00157 | 0.00015 12 | 0.14324 | 0.02217 | 0.00163 | 0.00027 13 | 0.13582 | 0.02507 | 0.00253 | 0.08233 14 | 0.14438 | 0.02631 | 0.00298 | 0.08716 15 | 0.14555 | 0.01531 | 0.00741 | 0.05170 16 | 0.14405 | 0.02410 | 0.00792 | 0.04020 17 | 0.15019 | 0.02785 | 0.00842 | 0.06130 18 | 0.14511 | 0.02288 | 0.00869 | 0.04584 19 | 0.14929 | 0.02792 | 0.00971 | 0.04737 20 | 0.15472 | 0.02289 | 0.01119 | 0.04714 21 | 0.15934 | 0.04284 | 0.01612 | 0.07518 22 | 0.16652 | 0.04939 | 0.02037 | 0.08824 23 | 0.20038 | 0.09279 | 0.03760 | 0.12448 24 | 0.34668 | 0.29527 | 0.06302 | 0.28741 25 | 0.68856 | 0.60402 | 0.14585 | 0.74427 26 | 1.34578 | 1.25987 | 0.29990 | 1.56413

CPU - dry run times - supremacy

nqubits | qibo | qibotf | dry_run_time_qiskit | dry_run_time_qulacs -- | -- | -- | -- | -- 3 | 0.04730 | 0.00359 | 0.00142 | 0.00005 4 | 0.04759 | 0.00379 | 0.00136 | 0.00005 5 | 0.04874 | 0.01179 | 0.00153 | 0.00005 6 | 0.05135 | 0.00473 | 0.00161 | 0.00005 7 | 0.04738 | 0.00469 | 0.00159 | 0.00006 8 | 0.04834 | 0.00498 | 0.00171 | 0.00006 9 | 0.05123 | 0.00518 | 0.00171 | 0.00007 10 | 0.05125 | 0.00554 | 0.00181 | 0.00010 11 | 0.05401 | 0.00590 | 0.00174 | 0.00017 12 | 0.04822 | 0.00623 | 0.00215 | 0.00556 13 | 0.04880 | 0.00647 | 0.00416 | 0.00696 14 | 0.04886 | 0.00697 | 0.00378 | 0.01016 15 | 0.04925 | 0.00717 | 0.00946 | 0.02494 16 | 0.05146 | 0.00830 | 0.00878 | 0.02542 17 | 0.05158 | 0.00923 | 0.00940 | 0.02830 18 | 0.05266 | 0.01147 | 0.00920 | 0.03056 19 | 0.05991 | 0.01528 | 0.01024 | 0.01315 20 | 0.05999 | 0.02105 | 0.01199 | 0.05079 21 | 0.06598 | 0.03111 | 0.01515 | 0.03932 22 | 0.08216 | 0.04886 | 0.02412 | 0.05359 23 | 0.12164 | 0.09231 | 0.03420 | 0.10655 24 | 0.31780 | 0.26941 | 0.05419 | 0.23985 25 | 0.66525 | 0.65051 | 0.13104 | 0.81683 26 | 1.41856 | 1.44158 | 0.25266 | 1.71436

CPU - dry run times - bc

nqubits | qibo | qibotf | dry_run_time_qiskit | dry_run_time_qulacs -- | -- | -- | -- | -- 3 | 0.10865 | 0.03400 | 0.00193 | 0.00006 4 | 0.07797 | 0.05211 | 0.00279 | 0.00006 5 | 0.10672 | 0.04472 | 0.00358 | 0.00008 6 | 0.10741 | 0.09421 | 0.00513 | 0.00011 7 | 0.10422 | 0.06441 | 0.00638 | 0.00020 8 | 0.10134 | 0.09720 | 0.00830 | 0.00044 9 | 0.09091 | 0.13271 | 0.01143 | 0.00100 10 | 0.24327 | 0.13525 | 0.01711 | 0.00239 11 | 0.23828 | 0.14882 | 0.02787 | 0.00573 12 | 0.25085 | 0.17866 | 0.05522 | 0.08972 13 | 0.25957 | 0.19384 | 0.10197 | 0.07487 14 | 0.27024 | 0.21175 | 0.21329 | 0.12629 15 | 0.29742 | 0.25463 | | 0.11639 16 | 0.32297 | 0.43239 | | 0.14785 17 | 0.34809 | 0.49112 | | 0.18162 18 | 0.41349 | 0.59651 | | 0.22338 19 | 0.35504 | 0.61407 | | 0.35328 20 | 0.55181 | 0.91910 | | 0.53437 21 | 0.93586 | 1.49616 | | 0.92757 22 | 1.45611 | 2.79792 | | 1.78871 23 | 3.11519 | 4.29380 | | 4.53765 24 | 8.05254 | 10.77202 | | 12.88755 25 | 60.40125 | 59.76469 | | 55.70321 26 | 131.39063 | 132.83400 | | 128.20206

CPU - dry run times - qv

nqubits | qibo | qibotf | dry_run_time_qiskit | dry_run_time_qulacs -- | -- | -- | -- | -- 3 | 0.04844 | 0.00367 | 0.00135 | 0.00005 4 | 0.05273 | 0.00486 | 0.00156 | 0.00005 5 | 0.04937 | 0.01037 | 0.00159 | 0.00006 6 | 0.04929 | 0.00585 | 0.00156 | 0.00006 7 | 0.04943 | 0.00590 | 0.00160 | 0.00007 8 | 0.04949 | 0.00698 | 0.00195 | 0.00010 9 | 0.04952 | 0.00712 | 0.00175 | 0.00015 10 | 0.05000 | 0.00838 | 0.00195 | 0.00029 11 | 0.05059 | 0.00824 | 0.00206 | 0.00052 12 | 0.05063 | 0.01464 | 0.00241 | 0.00118 13 | 0.05086 | 0.00964 | 0.00307 | 0.02454 14 | 0.05115 | 0.01086 | 0.00467 | 0.01563 15 | 0.05211 | 0.01110 | 0.01296 | 0.01522 16 | 0.05456 | 0.01279 | 0.00841 | 0.02311 17 | 0.05384 | 0.01446 | 0.01317 | 0.01946 18 | 0.05632 | 0.01779 | 0.00892 | 0.02220 19 | 0.05945 | 0.02175 | 0.01019 | 0.03308 20 | 0.06883 | 0.02923 | 0.01302 | 0.04184 21 | 0.07827 | 0.04620 | 0.01464 | 0.05152 22 | 0.11751 | 0.08035 | 0.02167 | 0.09155 23 | 0.16683 | 0.15201 | 0.03156 | 0.18035 24 | 0.35416 | 0.42914 | 0.05478 | 0.40510 25 | 1.08719 | 1.02541 | 0.10284 | 1.15616 26 | 2.26871 | 2.31484 | 0.21652 | 2.63840

CPU - dry run times - hs

nqubits | qibo | qibotf | dry_run_time_qiskit | dry_run_time_qulacs -- | -- | -- | -- | -- 3 | 0.14078 | 0.01123 | 0.00131 | 0.00004 4 | 0.14038 | 0.05976 | 0.00112 | 0.00005 5 | 0.14218 | 0.02236 | 0.00114 | 0.00004 6 | 0.14158 | 0.02755 | 0.00119 | 0.00004 7 | 0.14324 | 0.04778 | 0.00146 | 0.00005 8 | 0.14234 | 0.02460 | 0.00129 | 0.00005 9 | 0.14140 | 0.02000 | 0.00131 | 0.00007 10 | 0.14476 | 0.03292 | 0.00154 | 0.00012 11 | 0.16313 | 0.02884 | 0.00175 | 0.00020 12 | 0.14204 | 0.02411 | 0.00263 | 0.00042 13 | 0.14248 | 0.01856 | 0.00286 | 0.04310 14 | 0.14593 | 0.02078 | 0.00446 | 0.05991 15 | 0.14506 | 0.01933 | 0.00704 | 0.07036 16 | 0.15695 | 0.02311 | 0.00738 | 0.06787 17 | 0.14624 | 0.02027 | 0.00799 | 0.06284 18 | 0.14966 | 0.02264 | 0.00815 | 0.06346 19 | 0.14848 | 0.03451 | 0.00890 | 0.05979 20 | 0.15762 | 0.04424 | 0.01076 | 0.06404 21 | 0.17340 | 0.05215 | 0.01456 | 0.06969 22 | 0.18809 | 0.08329 | 0.01836 | 0.12943 23 | 0.22851 | 0.14311 | 0.03769 | 0.13979 24 | 0.55507 | 0.45082 | 0.06076 | 0.38338 25 | 1.12732 | 1.10636 | 0.10539 | 1.24845 26 | 2.08358 | 2.06528 | 0.19700 | 2.34944 27 | 3.82514 | 3.91613 | 0.40058 | 4.42283 28 | 8.15916 | 8.23759 | 0.80033 | 9.46240 29 | 19.13270 | 19.33549 | 1.58801 | 22.02400 30 | 34.21462 | 34.81286 | 3.11735 | 39.77831

CPU - simulation times - qft

nqubits | qibo | qibotf | qiskit | qulacs -- | -- | -- | -- | -- 3 | 0.00023 | 0.00029 | 0.00048 | 0.00001 4 | 0.00035 | 0.00041 | 0.00057 | 0.00001 5 | 0.00049 | 0.00046 | 0.00058 | 0.00001 6 | 0.00066 | 0.00095 | 0.00068 | 0.00002 7 | 0.00083 | 0.00093 | 0.00072 | 0.00002 8 | 0.00106 | 0.00124 | 0.00093 | 0.00006 9 | 0.00126 | 0.00112 | 0.00099 | 0.00012 10 | 0.00149 | 0.00185 | 0.00120 | 0.00026 11 | 0.00177 | 0.00215 | 0.00165 | 0.00062 12 | 0.00208 | 0.00245 | 0.00253 | 0.00089 13 | 0.00243 | 0.00284 | 0.00425 | 0.00325 14 | 0.00289 | 0.00334 | 0.00803 | 0.00403 15 | 0.00326 | 0.00388 | 0.03220 | 0.00480 16 | 0.00393 | 0.00475 | 0.03673 | 0.00605 17 | 0.00463 | 0.00485 | 0.04235 | 0.00792 18 | 0.00613 | 0.00786 | 0.04848 | 0.01138 19 | 0.00840 | 0.01024 | 0.06044 | 0.01702 20 | 0.01361 | 0.01905 | 0.07265 | 0.02933 21 | 0.02258 | 0.03532 | 0.09714 | 0.06354 22 | 0.04033 | 0.06544 | 0.10747 | 0.11940 23 | 0.08698 | 0.14522 | 0.18945 | 0.24661 24 | 0.28090 | 0.32265 | 0.31718 | 0.58849 25 | 0.84842 | 0.87645 | 0.91841 | 1.49977 26 | 2.55392 | 2.24508 | 1.88219 | 4.00886

CPU - simulation times - variational

nqubits | qibo | qibotf | qiskit | qulacs -- | -- | -- | -- | -- 3 | 0.00025 | 0.00029 | 0.00049 | 0.00001 4 | 0.00036 | 0.00042 | 0.00053 | 0.00001 5 | 0.00042 | 0.00037 | 0.00055 | 0.00001 6 | 0.00049 | 0.00047 | 0.00059 | 0.00002 7 | 0.00057 | 0.00072 | 0.00063 | 0.00003 8 | 0.00065 | 0.00077 | 0.00067 | 0.00005 9 | 0.00069 | 0.00081 | 0.00071 | 0.00006 10 | 0.00079 | 0.00095 | 0.00080 | 0.00022 11 | 0.00085 | 0.00104 | 0.00092 | 0.00046 12 | 0.00095 | 0.00108 | 0.00118 | 0.00100 13 | 0.00103 | 0.00122 | 0.00155 | 0.00135 14 | 0.00118 | 0.00108 | 0.00256 | 0.00160 15 | 0.00125 | 0.00156 | 0.00529 | 0.00183 16 | 0.00155 | 0.00180 | 0.00540 | 0.00220 17 | 0.00187 | 0.00220 | 0.00607 | 0.00293 18 | 0.00292 | 0.00343 | 0.00675 | 0.00464 19 | 0.00425 | 0.00457 | 0.00821 | 0.00682 20 | 0.00654 | 0.00897 | 0.01051 | 0.01187 21 | 0.01081 | 0.01832 | 0.01775 | 0.03918 22 | 0.02311 | 0.03407 | 0.02429 | 0.06024 23 | 0.05224 | 0.08439 | 0.03916 | 0.10856 24 | 0.18787 | 0.24245 | 0.07135 | 0.24549 25 | 0.50435 | 0.53373 | 0.16767 | 0.69856 26 | 1.08907 | 1.17849 | 0.32472 | 1.47058

CPU - simulation times - bv

nqubits | qibo | qibotf | qiskit | qulacs -- | -- | -- | -- | -- 3 | 0.00026 | 0.00030 | 0.00048 | 0.00001 4 | 0.00032 | 0.00037 | 0.00052 | 0.00001 5 | 0.00040 | 0.00047 | 0.00054 | 0.00001 6 | 0.00049 | 0.00065 | 0.00057 | 0.00001 7 | 0.00057 | 0.00066 | 0.00059 | 0.00001 8 | 0.00064 | 0.00185 | 0.00062 | 0.00001 9 | 0.00071 | 0.00063 | 0.00067 | 0.00003 10 | 0.00077 | 0.00118 | 0.00072 | 0.00005 11 | 0.00226 | 0.00114 | 0.00083 | 0.00010 12 | 0.00092 | 0.00170 | 0.00107 | 0.00020 13 | 0.00107 | 0.00129 | 0.00151 | 0.00137 14 | 0.00115 | 0.00124 | 0.00237 | 0.00152 15 | 0.00131 | 0.00158 | 0.00469 | 0.00180 16 | 0.00151 | 0.00161 | 0.00517 | 0.00205 17 | 0.00192 | 0.00200 | 0.00569 | 0.00252 18 | 0.00457 | 0.00324 | 0.00656 | 0.00363 19 | 0.00407 | 0.00465 | 0.00766 | 0.00534 20 | 0.00638 | 0.00887 | 0.01019 | 0.00869 21 | 0.01297 | 0.01710 | 0.01697 | 0.02890 22 | 0.02309 | 0.03547 | 0.02538 | 0.04603 23 | 0.05003 | 0.08153 | 0.04015 | 0.08575 24 | 0.19126 | 0.26410 | 0.07391 | 0.22320 25 | 0.53018 | 0.56650 | 0.16424 | 0.72453 26 | 1.18478 | 1.28438 | 0.33763 | 1.56880

CPU - simulation times - supremacy

nqubits | qibo | qibotf | qiskit | qulacs -- | -- | -- | -- | -- 3 | 0.00031 | 0.00115 | 0.00052 | 0.00001 4 | 0.00038 | 0.00039 | 0.00054 | 0.00001 5 | 0.00046 | 0.00049 | 0.00057 | 0.00001 6 | 0.00055 | 0.00056 | 0.00061 | 0.00001 7 | 0.00065 | 0.00064 | 0.00063 | 0.00001 8 | 0.00071 | 0.00068 | 0.00067 | 0.00002 9 | 0.00079 | 0.00076 | 0.00076 | 0.00003 10 | 0.00085 | 0.00088 | 0.00081 | 0.00006 11 | 0.00095 | 0.00096 | 0.00093 | 0.00011 12 | 0.00104 | 0.00110 | 0.00122 | 0.00045 13 | 0.00116 | 0.00116 | 0.00287 | 0.00152 14 | 0.00131 | 0.00135 | 0.00306 | 0.00170 15 | 0.00146 | 0.00154 | 0.00470 | 0.00193 16 | 0.00176 | 0.00196 | 0.00484 | 0.00226 17 | 0.00215 | 0.00261 | 0.00500 | 0.00277 18 | 0.00340 | 0.00353 | 0.00623 | 0.00391 19 | 0.00481 | 0.00565 | 0.00725 | 0.00605 20 | 0.00783 | 0.00965 | 0.00933 | 0.00960 21 | 0.01568 | 0.01997 | 0.01557 | 0.02993 22 | 0.02940 | 0.03944 | 0.02405 | 0.04505 23 | 0.07014 | 0.08953 | 0.03592 | 0.09079 24 | 0.25279 | 0.24751 | 0.06739 | 0.23617 25 | 0.63385 | 0.68515 | 0.14553 | 0.83233 26 | 1.33916 | 1.41088 | 0.28359 | 1.73598

CPU - simulation times - bc

nqubits | qibo | qibotf | qiskit | qulacs -- | -- | -- | -- | -- 3 | 0.00203 | 0.00264 | 0.00115 | 0.00001 4 | 0.00514 | 0.00526 | 0.00260 | 0.00002 5 | 0.00815 | 0.00790 | 0.00264 | 0.00003 6 | 0.01170 | 0.01159 | 0.00389 | 0.00007 7 | 0.01666 | 0.01571 | 0.00527 | 0.00016 8 | 0.02274 | 0.01735 | 0.00737 | 0.00041 9 | 0.02237 | 0.02723 | 0.01975 | 0.00096 10 | 0.03307 | 0.03353 | 0.02533 | 0.00236 11 | 0.03635 | 0.03972 | 0.02647 | 0.00553 12 | 0.04149 | 0.04628 | 0.05496 | 0.04748 13 | 0.04948 | 0.05336 | 0.09865 | 0.07355 14 | 0.06095 | 0.05566 | 0.21272 | 0.08081 15 | 0.07952 | 0.06912 | | 0.10210 16 | 0.10098 | 0.08495 | | 0.12730 17 | 0.12224 | 0.12860 | | 0.16115 18 | 0.16834 | 0.16485 | | 0.22653 19 | 0.21687 | 0.26676 | | 0.34079 20 | 0.45414 | 0.41300 | | 0.56256 21 | 0.61478 | 0.79633 | | 0.91847 22 | 1.15462 | 1.98153 | | 1.77941 23 | 2.50292 | 3.33044 | | 4.52949 24 | 5.60270 | 8.59538 | | 12.64645 25 | 59.86970 | 57.75547 | | 55.59625 26 | 130.59354 | 130.69623 | | 128.17125

CPU - simulation times - qv

nqubits | qibo | qibotf | qiskit | qulacs -- | -- | -- | -- | -- 3 | 0.00032 | 0.00031 | 0.00053 | 0.00001 4 | 0.00092 | 0.00056 | 0.00063 | 0.00001 5 | 0.00059 | 0.00057 | 0.00065 | 0.00001 6 | 0.00086 | 0.00084 | 0.00073 | 0.00002 7 | 0.00086 | 0.00085 | 0.00073 | 0.00003 8 | 0.00117 | 0.00109 | 0.00084 | 0.00006 9 | 0.00113 | 0.00109 | 0.00086 | 0.00010 10 | 0.00139 | 0.00137 | 0.00110 | 0.00024 11 | 0.00139 | 0.00140 | 0.00112 | 0.00048 12 | 0.00167 | 0.00164 | 0.00159 | 0.00112 13 | 0.00175 | 0.00178 | 0.00296 | 0.00234 14 | 0.00211 | 0.00211 | 0.00379 | 0.00283 15 | 0.00246 | 0.00219 | 0.00478 | 0.00309 16 | 0.00305 | 0.00289 | 0.00459 | 0.00399 17 | 0.00352 | 0.00374 | 0.00544 | 0.00493 18 | 0.00542 | 0.00558 | 0.00647 | 0.00767 19 | 0.00805 | 0.00856 | 0.00724 | 0.01150 20 | 0.01417 | 0.01614 | 0.00917 | 0.01941 21 | 0.02470 | 0.03082 | 0.01379 | 0.04733 22 | 0.05245 | 0.06647 | 0.02163 | 0.08159 23 | 0.11709 | 0.14571 | 0.03694 | 0.15883 24 | 0.29440 | 0.40059 | 0.06753 | 0.40329 25 | 1.01247 | 1.07168 | 0.12397 | 1.16924 26 | 2.17764 | 2.24001 | 0.24122 | 2.65551

CPU - simulation times - hs

nqubits | qibo | qibotf | qiskit | qulacs -- | -- | -- | -- | -- 3 | 0.00037 | 0.00042 | 0.00058 | 0.00002 4 | 0.00059 | 0.00065 | 0.00062 | 0.00002 5 | 0.00064 | 0.00071 | 0.00064 | 0.00002 6 | 0.00081 | 0.00096 | 0.00071 | 0.00002 7 | 0.00093 | 0.00109 | 0.00076 | 0.00002 8 | 0.00110 | 0.00126 | 0.00081 | 0.00003 9 | 0.00113 | 0.00212 | 0.00085 | 0.00005 10 | 0.00143 | 0.00155 | 0.00101 | 0.00009 11 | 0.00145 | 0.00161 | 0.00111 | 0.00017 12 | 0.00166 | 0.00166 | 0.00161 | 0.00039 13 | 0.00180 | 0.00263 | 0.00223 | 0.00222 14 | 0.00213 | 0.00201 | 0.00433 | 0.00273 15 | 0.00230 | 0.00265 | 0.00340 | 0.00318 16 | 0.00260 | 0.00320 | 0.00354 | 0.00383 17 | 0.00309 | 0.00382 | 0.00362 | 0.00475 18 | 0.00585 | 0.00527 | 0.00491 | 0.00673 19 | 0.00896 | 0.00955 | 0.00653 | 0.01052 20 | 0.01241 | 0.01585 | 0.00946 | 0.01860 21 | 0.02027 | 0.02632 | 0.01529 | 0.03455 22 | 0.03742 | 0.05523 | 0.02241 | 0.05722 23 | 0.08511 | 0.11802 | 0.04046 | 0.11015 24 | 0.36213 | 0.35885 | 0.07321 | 0.32752 25 | 1.05356 | 1.05197 | 0.12127 | 1.20950 26 | 1.92000 | 1.97689 | 0.24272 | 2.33197 27 | 3.65685 | 3.83122 | 0.49052 | 4.45067 28 | 7.95587 | 8.27206 | 0.97463 | 9.56781 29 | 18.97606 | 19.55705 | 1.89644 | 22.21409 30 | 33.93688 | 35.07520 | 3.71810 | 40.22109

EDIT: Added qibotf times.

Thanks for these numbers, do you have similar number for qibotf? For some circuits like hs and qv the difference is too large, are you sure that qiskit is using CPU instead of GPU? What is the average total program execution time, maybe qiskit is precomputing objects during the circuit definition? Does the final state vector is the same for all backends?

Btw, how many threads qiskit is using? It might be possible that this value is different from our default, e.g. limiting the number of threads might have an impact.

Bwt2, does qiskit is really double precision? If I set qibo to single, I get numbers which are quite close to qiskit...

Thanks for the response and the questions. Some quick answers:

Thanks for these numbers, do you have similar number for qibotf?

I added the possibility to use qibotf in the same script in the latest push, I will update the above tables once I have the numbers. I don't expect much difference from qibojit, certainly will not be much closer to Qiskit.

For some circuits like hs and qv the difference is too large, are you sure that qiskit is using CPU instead of GPU?

I haven't checked htop explicitly during all benchmarks but all the Qiskit runs I checked use CPU. I think Qiskit only uses GPU when the appropriate simulator is used. I also used export CUDA_VISIBLE_DEVICES="" before running all these benchmarks.

What is the average total program execution time, maybe qiskit is precomputing objects during the circuit definition?

The benchmark script logs the circuit creation time too, which in this corresponds to transforming the OpenQASM circuit to the library circuit. Here are the numbers from the above benchmarks:

CPU - circuit creation times - qft

nqubits | qibo | qiskit | qulacs -- | -- | -- | -- 3 | 0.00137 | 0.03624 | 0.00042 4 | 0.00147 | 0.03691 | 0.00053 5 | 0.00164 | 0.03788 | 0.00065 6 | 0.00182 | 0.03867 | 0.00120 7 | 0.00205 | 0.03959 | 0.00077 8 | 0.00227 | 0.04098 | 0.00123 9 | 0.00253 | 0.04207 | 0.00141 10 | 0.00282 | 0.04366 | 0.00171 11 | 0.00316 | 0.04485 | 0.00196 12 | 0.00359 | 0.04626 | 0.00174 13 | 0.00403 | 0.04921 | 0.00277 14 | 0.00438 | 0.05247 | 0.00305 15 | 0.00489 | 0.05358 | 0.00344 16 | 0.00536 | 0.05514 | 0.00384 17 | 0.00582 | 0.05849 | 0.00419 18 | 0.00648 | 0.06071 | 0.00467 19 | 0.00711 | 0.06336 | 0.00377 20 | 0.00738 | 0.06713 | 0.00569 21 | 0.00814 | 0.06927 | 0.00483 22 | 0.00948 | 0.07347 | 0.00701 23 | 0.01007 | 0.07802 | 0.00769 24 | 0.01077 | 0.08100 | 0.00605 25 | 0.01166 | 0.08356 | 0.00656 26 | 0.01228 | 0.08735 | 0.00967

CPU - circuit creation times - variational

nqubits | qibo | qiskit | qulacs -- | -- | -- | -- 3 | 0.00125 | 0.03717 | 0.00034 4 | 0.00134 | 0.03736 | 0.00036 5 | 0.00139 | 0.03768 | 0.00048 6 | 0.00151 | 0.03789 | 0.00051 7 | 0.00156 | 0.03824 | 0.00054 8 | 0.00164 | 0.03890 | 0.00057 9 | 0.00169 | 0.03837 | 0.00041 10 | 0.00182 | 0.03879 | 0.00067 11 | 0.00184 | 0.03974 | 0.00070 12 | 0.00193 | 0.03997 | 0.00073 13 | 0.00200 | 0.04018 | 0.00086 14 | 0.00212 | 0.04096 | 0.00086 15 | 0.00219 | 0.04074 | 0.00084 16 | 0.00224 | 0.04129 | 0.00091 17 | 0.00233 | 0.04119 | 0.00067 18 | 0.00244 | 0.04113 | 0.00099 19 | 0.00247 | 0.04150 | 0.00074 20 | 0.00252 | 0.04252 | 0.00077 21 | 0.00257 | 0.04281 | 0.00105 22 | 0.00263 | 0.04340 | 0.00111 23 | 0.00266 | 0.04362 | 0.00112 24 | 0.00276 | 0.04406 | 0.00119 25 | 0.00288 | 0.04354 | 0.00124 26 | 0.00292 | 0.04503 | 0.00133

CPU - circuit creation times - bv

nqubits | qibo | qiskit | qulacs -- | -- | -- | -- 3 | 0.00112 | 0.03656 | 0.00029 4 | 0.00118 | 0.03729 | 0.00031 5 | 0.00130 | 0.03682 | 0.00034 6 | 0.00132 | 0.03758 | 0.00038 7 | 0.00135 | 0.03702 | 0.00039 8 | 0.00141 | 0.03897 | 0.00032 9 | 0.00146 | 0.03771 | 0.00046 10 | 0.00150 | 0.03846 | 0.00049 11 | 0.00159 | 0.03900 | 0.00055 12 | 0.00159 | 0.03907 | 0.00055 13 | 0.00165 | 0.03917 | 0.00062 14 | 0.00175 | 0.04003 | 0.00066 15 | 0.00179 | 0.03949 | 0.00063 16 | 0.00189 | 0.03946 | 0.00060 17 | 0.00199 | 0.03976 | 0.00068 18 | 0.00201 | 0.04076 | 0.00071 19 | 0.00210 | 0.04103 | 0.00079 20 | 0.00207 | 0.04059 | 0.00079 21 | 0.00219 | 0.04237 | 0.00082 22 | 0.00214 | 0.04115 | 0.00082 23 | 0.00225 | 0.04153 | 0.00087 24 | 0.00230 | 0.04144 | 0.00089 25 | 0.00243 | 0.04178 | 0.00101 26 | 0.00241 | 0.04325 | 0.00101

CPU - circuit creation times - supremacy

nqubits | qibo | qiskit | qulacs -- | -- | -- | -- 3 | 0.00229 | 0.03562 | 0.00081 4 | 0.00246 | 0.03607 | 0.00095 5 | 0.00255 | 0.03639 | 0.00097 6 | 0.00269 | 0.03752 | 0.00114 7 | 0.00279 | 0.03658 | 0.00122 8 | 0.00303 | 0.03744 | 0.00129 9 | 0.00319 | 0.03778 | 0.00139 10 | 0.00327 | 0.03864 | 0.00155 11 | 0.00349 | 0.03892 | 0.00160 12 | 0.00359 | 0.03839 | 0.00167 13 | 0.00376 | 0.03931 | 0.00177 14 | 0.00388 | 0.03961 | 0.00190 15 | 0.00397 | 0.03987 | 0.00193 16 | 0.00410 | 0.04024 | 0.00206 17 | 0.00416 | 0.04043 | 0.00217 18 | 0.00433 | 0.04082 | 0.00226 19 | 0.00481 | 0.04117 | 0.00236 20 | 0.00489 | 0.04243 | 0.00245 21 | 0.00500 | 0.04196 | 0.00250 22 | 0.00523 | 0.04239 | 0.00269 23 | 0.00535 | 0.04270 | 0.00277 24 | 0.00531 | 0.04284 | 0.00282 25 | 0.00557 | 0.04356 | 0.00283 26 | 0.00554 | 0.04380 | 0.00305

CPU - circuit creation times - bc

nqubits | qibo | qiskit | qulacs -- | -- | -- | -- 3 | 0.00698 | 0.04693 | 0.00583 4 | 0.01214 | 0.05820 | 0.00840 5 | 0.01964 | 0.07442 | 0.01379 6 | 0.02745 | 0.09376 | 0.01963 7 | 0.03747 | 0.11618 | 0.02688 8 | 0.05066 | 0.14329 | 0.03724 9 | 0.06499 | 0.17409 | 0.04517 10 | 0.08000 | 0.20550 | 0.05642 11 | 0.09488 | 0.34107 | 0.06893 12 | 0.11436 | 0.37615 | 0.08361 13 | 0.12946 | 0.42903 | 0.09443 14 | 0.15251 | 0.48116 | 0.11314 15 | 0.17660 | | 0.12898 16 | 0.19943 | | 0.14574 17 | 0.21900 | | 0.16487 18 | 0.24287 | | 0.18173 19 | 0.40476 | | 0.19997 20 | 0.43427 | | 0.22061 21 | 0.46144 | | 0.24401 22 | 0.49091 | | 0.26230 23 | 0.51685 | | 0.28915 24 | 0.56229 | | 0.31276 25 | 0.59079 | | 0.34755 26 | 0.63578 | | 0.36804

CPU - circuit creation times - qv

nqubits | qibo | qiskit | qulacs -- | -- | -- | -- 3 | 0.00550 | 0.26850 | 0.23649 4 | 0.00595 | 0.26594 | 0.23317 5 | 0.00617 | 0.27456 | 0.24185 6 | 0.00679 | 0.27547 | 0.24057 7 | 0.00673 | 0.27290 | 0.24221 8 | 0.00735 | 0.27700 | 0.24050 9 | 0.00735 | 0.27422 | 0.23694 10 | 0.00792 | 0.28032 | 0.23819 11 | 0.00779 | 0.28106 | 0.24027 12 | 0.00845 | 0.28351 | 0.23970 13 | 0.00860 | 0.28016 | 0.23401 14 | 0.00914 | 0.28502 | 0.23784 15 | 0.00905 | 0.28237 | 0.23578 16 | 0.00960 | 0.27850 | 0.24174 17 | 0.00963 | 0.28461 | 0.23565 18 | 0.01040 | 0.28696 | 0.23779 19 | 0.01007 | 0.28674 | 0.24116 20 | 0.01086 | 0.28600 | 0.23942 21 | 0.01082 | 0.28849 | 0.24169 22 | 0.01154 | 0.28992 | 0.24264 23 | 0.01143 | 0.29168 | 0.23564 24 | 0.01178 | 0.29128 | 0.23843 25 | 0.01205 | 0.28958 | 0.24142 26 | 0.01254 | 0.29300 | 0.24495

CPU - circuit creation times - hs

nqubits | qibo | qiskit | qulacs -- | -- | -- | -- 3 | 0.00103 | 0.03652 | 0.00031 4 | 0.00138 | 0.03680 | 0.00045 5 | 0.00137 | 0.03762 | 0.00045 6 | 0.00154 | 0.03868 | 0.00053 7 | 0.00162 | 0.03808 | 0.00057 8 | 0.00171 | 0.03890 | 0.00063 9 | 0.00173 | 0.03853 | 0.00064 10 | 0.00192 | 0.03964 | 0.00071 11 | 0.00198 | 0.03991 | 0.00070 12 | 0.00212 | 0.04040 | 0.00079 13 | 0.00212 | 0.04141 | 0.00081 14 | 0.00246 | 0.04184 | 0.00094 15 | 0.00240 | 0.04115 | 0.00092 16 | 0.00247 | 0.04177 | 0.00098 17 | 0.00243 | 0.04270 | 0.00095 18 | 0.00264 | 0.04398 | 0.00104 19 | 0.00276 | 0.04423 | 0.00112 20 | 0.00285 | 0.04403 | 0.00123 21 | 0.00291 | 0.04433 | 0.00117 22 | 0.00303 | 0.04482 | 0.00131 23 | 0.00290 | 0.04475 | 0.00128 24 | 0.00328 | 0.04597 | 0.00139 25 | 0.00346 | 0.04679 | 0.00145 26 | 0.00336 | 0.04689 | 0.00143 27 | 0.00326 | 0.04610 | 0.00137 28 | 0.00343 | 0.04656 | 0.00106 29 | 0.00376 | 0.04879 | 0.00172 30 | 0.00357 | 0.04819 | 0.00158

Indeed Qiskit has slighlty higher creation in all cases but still wins when considering the sum creation + execution.

Does the final state vector is the same for all backends?

This is exactly what is tested in the new test_libraries.py for all circuits, except qv due to the U3 convention issue. I will try to do a check using the benchmark script too but from a quick look it seems that Qiskit returns the expected states, that's why I wrote that I don't think that something strange like skipping the simulation happens.

Btw, how many threads qiskit is using? It might be possible that this value is different from our default, e.g. limiting the number of threads might have an impact.

Qiskit and Qulacs use all available threads while Qibo uses half of them. This may cause some of the difference but I don't think it explains the whole difference. In past Qibo benchmarks using all threads had minimal change in performance.

Bwt2, does qiskit is really double precision? If I set qibo to single, I get numbers which are quite close to qiskit...

I am not sure exactly what happens during simulation but if I do result.dtype in the state returned by Qiskit I get complex128. Also according to their docs double precision is used by default.

@stavros11 thanks for the comments. I have tested and indeed qiskit is 2x faster when using single precision. Starting from the QFT, if I keep only first layer of H gates, qiskit is 1s faster than qibo. At this point we should revisit each gate, if the single gates have similar performance, then I agree that some extra parallelization is performed by qiskit.

In particular, if I yield just 1 Hadamard the qibo performance is better than qiskit, however as soon as I include 5 Hadamard, one per qubit, the qiskit performance is better, so this sounds like circuit fusion/block parallelization.

Following their docs I think this latest version of qiskit:

uses openmp for parallel evaluation
enables openmp if nqubits >= 14
uses fusion by default

Last comment about that, if I set self.simulator.set_options(fusion_enable=False) and use all threads in qibojit, I get almost the same performance for qiskit and qibo. So, it is the fusion that accelerates the computation. We should look into that and check if qibo can support it.

Last comment about that, if I set self.simulator.set_options(fusion_enable=False) and use all threads in qibojit, I get almost the same performance for qiskit and qibo. So, it is the fusion that accelerates the computation.

I have been doing the benchmark using the same option and can confirm that performance is the same with Qibo. Here are the results for all circuits:

CPU - dry run times - qft

nqubits | qibojit | qibotf | qiskit | qiskit-nofusion -- | -- | -- | -- | -- 3 | 0.14609 | 0.01614 | 0.00141 | 0.00102 4 | 0.14219 | 0.02567 | 0.00114 | 0.00119 5 | 0.13952 | 0.01790 | 0.00106 | 0.00145 6 | 0.14063 | 0.01482 | 0.00114 | 0.00112 7 | 0.14224 | 0.03153 | 0.00129 | 0.00120 8 | 0.14368 | 0.02951 | 0.00160 | 0.00174 9 | 0.14293 | 0.02227 | 0.00157 | 0.00200 10 | 0.14489 | 0.02659 | 0.00196 | 0.00174 11 | 0.14266 | 0.02941 | 0.00227 | 0.00217 12 | 0.14414 | 0.02864 | 0.00305 | 0.00342 13 | 0.14532 | 0.01835 | 0.00476 | 0.00468 14 | 0.14601 | 0.02937 | 0.00911 | 0.00867 15 | 0.15146 | 0.02963 | 0.03047 | 0.06162 16 | 0.15435 | 0.03529 | 0.03393 | 0.07151 17 | 0.14970 | 0.03201 | 0.03692 | 0.08129 18 | 0.15598 | 0.03784 | 0.04524 | 0.09627 19 | 0.15866 | 0.03899 | 0.05674 | 0.11165 20 | 0.16004 | 0.04898 | 0.07397 | 0.13880 21 | 0.17945 | 0.06781 | 0.09618 | 0.19589 22 | 0.18606 | 0.09782 | 0.11774 | 0.28979 23 | 0.24259 | 0.19192 | 0.17108 | 0.47256 24 | 0.48633 | 0.41005 | 0.38000 | 0.55978 25 | 1.13068 | 0.95941 | 0.91085 | 1.18741 26 | 2.75894 | 2.47023 | 1.88619 | 2.91471

CPU - dry run times - variational

nqubits | qibojit | qibotf | qiskit | qiskit-nofusion -- | -- | -- | -- | -- 3 | 0.13821 | 0.02626 | 0.00127 | 0.00105 4 | 0.13915 | 0.01163 | 0.00100 | 0.00107 5 | 0.13660 | 0.02306 | 0.00105 | 0.00141 6 | 0.13939 | 0.01711 | 0.00124 | 0.00113 7 | 0.13899 | 0.01630 | 0.00120 | 0.00104 8 | 0.14083 | 0.04023 | 0.00121 | 0.00124 9 | 0.13952 | 0.03896 | 0.00158 | 0.00112 10 | 0.13938 | 0.02687 | 0.00136 | 0.00162 11 | 0.14355 | 0.01670 | 0.00151 | 0.00172 12 | 0.14013 | 0.02445 | 0.00168 | 0.00158 13 | 0.14623 | 0.01961 | 0.00210 | 0.00263 14 | 0.14328 | 0.03573 | 0.00346 | 0.00309 15 | 0.14111 | 0.01913 | 0.00764 | 0.01736 16 | 0.14549 | 0.02167 | 0.00842 | 0.01963 17 | 0.14837 | 0.02696 | 0.00839 | 0.02126 18 | 0.14382 | 0.02851 | 0.00863 | 0.02218 19 | 0.14968 | 0.02721 | 0.01136 | 0.02605 20 | 0.15033 | 0.03511 | 0.01158 | 0.03093 21 | 0.15262 | 0.04414 | 0.01776 | 0.03962 22 | 0.16749 | 0.05572 | 0.02055 | 0.06225 23 | 0.20121 | 0.12087 | 0.03393 | 0.08359 24 | 0.34273 | 0.26135 | 0.05600 | 0.16211 25 | 0.66060 | 0.56991 | 0.14690 | 0.53463 26 | 1.22694 | 1.18731 | 0.28349 | 1.13857

CPU - dry run times - bv

nqubits | qibojit | qibotf | qiskit | qiskit-nofusion -- | -- | -- | -- | -- 3 | 0.14355 | 0.02230 | 0.00104 | 0.00150 4 | 0.15346 | 0.05262 | 0.00108 | 0.00115 5 | 0.14235 | 0.01966 | 0.00124 | 0.00107 6 | 0.14811 | 0.02366 | 0.00121 | 0.00126 7 | 0.14192 | 0.02395 | 0.00132 | 0.00110 8 | 0.14303 | 0.03168 | 0.00112 | 0.00125 9 | 0.14264 | 0.02597 | 0.00123 | 0.00119 10 | 0.14080 | 0.03298 | 0.00126 | 0.00116 11 | 0.14738 | 0.01784 | 0.00157 | 0.00143 12 | 0.14324 | 0.02217 | 0.00163 | 0.00159 13 | 0.13582 | 0.02507 | 0.00253 | 0.00220 14 | 0.14438 | 0.02631 | 0.00298 | 0.00297 15 | 0.14555 | 0.01531 | 0.00741 | 0.02394 16 | 0.14405 | 0.02410 | 0.00792 | 0.01852 17 | 0.15019 | 0.02785 | 0.00842 | 0.02635 18 | 0.14511 | 0.02288 | 0.00869 | 0.02385 19 | 0.14929 | 0.02792 | 0.00971 | 0.02737 20 | 0.15472 | 0.02289 | 0.01119 | 0.03358 21 | 0.15934 | 0.04284 | 0.01612 | 0.05133 22 | 0.16652 | 0.04939 | 0.02037 | 0.07103 23 | 0.20038 | 0.09279 | 0.03760 | 0.11483 24 | 0.34668 | 0.29527 | 0.06302 | 0.26456 25 | 0.68856 | 0.60402 | 0.14585 | 0.60601 26 | 1.34578 | 1.25987 | 0.29990 | 1.24007

CPU - dry run times - supremacy

nqubits | qibojit | qibotf | qiskit | qiskit-nofusion -- | -- | -- | -- | -- 3 | 0.04730 | 0.00359 | 0.00142 | 0.00146 4 | 0.04759 | 0.00379 | 0.00136 | 0.00155 5 | 0.04874 | 0.01179 | 0.00153 | 0.00147 6 | 0.05135 | 0.00473 | 0.00161 | 0.00147 7 | 0.04738 | 0.00469 | 0.00159 | 0.00144 8 | 0.04834 | 0.00498 | 0.00171 | 0.00145 9 | 0.05123 | 0.00518 | 0.00171 | 0.00184 10 | 0.05125 | 0.00554 | 0.00181 | 0.00193 11 | 0.05401 | 0.00590 | 0.00174 | 0.00210 12 | 0.04822 | 0.00623 | 0.00215 | 0.00205 13 | 0.04880 | 0.00647 | 0.00416 | 0.00276 14 | 0.04886 | 0.00697 | 0.00378 | 0.00384 15 | 0.04925 | 0.00717 | 0.00946 | 0.03849 16 | 0.05146 | 0.00830 | 0.00878 | 0.03031 17 | 0.05158 | 0.00923 | 0.00940 | 0.03290 18 | 0.05266 | 0.01147 | 0.00920 | 0.04191 19 | 0.05991 | 0.01528 | 0.01024 | 0.03712 20 | 0.05999 | 0.02105 | 0.01199 | 0.04890 21 | 0.06598 | 0.03111 | 0.01515 | 0.05854 22 | 0.08216 | 0.04886 | 0.02412 | 0.09568 23 | 0.12164 | 0.09231 | 0.03420 | 0.13533 24 | 0.31780 | 0.26941 | 0.05419 | 0.20371 25 | 0.66525 | 0.65051 | 0.13104 | 0.67836 26 | 1.41856 | 1.44158 | 0.25266 | 1.41567

CPU - dry run times - bc

nqubits | qibojit | qibotf | qiskit | qiskit-nofusion -- | -- | -- | -- | -- 3 | 0.10865 | 0.03400 | 0.00193 | 0.00226 4 | 0.07797 | 0.05211 | 0.00279 | 0.00257 5 | 0.10672 | 0.04472 | 0.00358 | 0.00380 6 | 0.10741 | 0.09421 | 0.00513 | 0.00488 7 | 0.10422 | 0.06441 | 0.00638 | 0.00627 8 | 0.10134 | 0.09720 | 0.00830 | 0.00862 9 | 0.09091 | 0.13271 | 0.01143 | 0.01162 10 | 0.24327 | 0.13525 | 0.01711 | 0.12332 11 | 0.23828 | 0.14882 | 0.02787 | 0.02777 12 | 0.25085 | 0.17866 | 0.05522 | 0.05113 13 | 0.25957 | 0.19384 | 0.10197 | 0.10160 14 | 0.27024 | 0.21175 | 0.21329 | 0.21323

CPU - dry run times - qv

nqubits | qibojit | qibotf | qiskit | qiskit-nofusion -- | -- | -- | -- | -- 3 | 0.04844 | 0.00367 | 0.00135 | 0.00139 4 | 0.05273 | 0.00486 | 0.00156 | 0.00162 5 | 0.04937 | 0.01037 | 0.00159 | 0.00157 6 | 0.04929 | 0.00585 | 0.00156 | 0.00171 7 | 0.04943 | 0.00590 | 0.00160 | 0.00164 8 | 0.04949 | 0.00698 | 0.00195 | 0.00177 9 | 0.04952 | 0.00712 | 0.00175 | 0.00172 10 | 0.05000 | 0.00838 | 0.00195 | 0.00174 11 | 0.05059 | 0.00824 | 0.00206 | 0.00187 12 | 0.05063 | 0.01464 | 0.00241 | 0.00268 13 | 0.05086 | 0.00964 | 0.00307 | 0.00298 14 | 0.05115 | 0.01086 | 0.00467 | 0.00471 15 | 0.05211 | 0.01110 | 0.01296 | 0.03167 16 | 0.05456 | 0.01279 | 0.00841 | 0.03815 17 | 0.05384 | 0.01446 | 0.01317 | 0.03695 18 | 0.05632 | 0.01779 | 0.00892 | 0.04353 19 | 0.05945 | 0.02175 | 0.01019 | 0.04842 20 | 0.06883 | 0.02923 | 0.01302 | 0.06003 21 | 0.07827 | 0.04620 | 0.01464 | 0.07892 22 | 0.11751 | 0.08035 | 0.02167 | 0.12473 23 | 0.16683 | 0.15201 | 0.03156 | 0.16395 24 | 0.35416 | 0.42914 | 0.05478 | 0.31862 25 | 1.08719 | 1.02541 | 0.10284 | 1.02304 26 | 2.26871 | 2.31484 | 0.21652 | 2.32828

CPU - dry run times - hs

nqubits | qibojit | qibotf | qiskit | qiskit-nofusion -- | -- | -- | -- | -- 3 | 0.14078 | 0.01123 | 0.00131 | 0.00109 4 | 0.14038 | 0.05976 | 0.00112 | 0.00127 5 | 0.14218 | 0.02236 | 0.00114 | 0.00101 6 | 0.14158 | 0.02755 | 0.00119 | 0.00106 7 | 0.14324 | 0.04778 | 0.00146 | 0.00129 8 | 0.14234 | 0.02460 | 0.00129 | 0.00144 9 | 0.14140 | 0.02000 | 0.00131 | 0.00133 10 | 0.14476 | 0.03292 | 0.00154 | 0.00148 11 | 0.16313 | 0.02884 | 0.00175 | 0.00151 12 | 0.14204 | 0.02411 | 0.00263 | 0.00245 13 | 0.14248 | 0.01856 | 0.00286 | 0.00274 14 | 0.14593 | 0.02078 | 0.00446 | 0.00501 15 | 0.14506 | 0.01933 | 0.00704 | 0.03207 16 | 0.15695 | 0.02311 | 0.00738 | 0.03810 17 | 0.14624 | 0.02027 | 0.00799 | 0.03172 18 | 0.14966 | 0.02264 | 0.00815 | 0.03970 19 | 0.14848 | 0.03451 | 0.00890 | 0.04505 20 | 0.15762 | 0.04424 | 0.01076 | 0.05183 21 | 0.17340 | 0.05215 | 0.01456 | 0.06366 22 | 0.18809 | 0.08329 | 0.01836 | 0.10526 23 | 0.22851 | 0.14311 | 0.03769 | 0.12050 24 | 0.55507 | 0.45082 | 0.06076 | 0.26734 25 | 1.12732 | 1.10636 | 0.10539 | 1.07681 26 | 2.08358 | 2.06528 | 0.19700 | 2.01730 27 | 3.82514 | 3.91613 | 0.40058 | 3.81916 28 | 8.15916 | 8.23759 | 0.80033 | 8.24329 29 | 19.13270 | 19.33549 | 1.58801 | 19.64304 30 | 34.21462 | 34.81286 | 3.11735 | 35.19634

CPU - simulation times - qft

nqubits | qibojit | qibotf | qiskit | qiskit-nofusion -- | -- | -- | -- | -- 3 | 0.00023 | 0.00029 | 0.00048 | 0.00048 4 | 0.00035 | 0.00041 | 0.00057 | 0.00053 5 | 0.00049 | 0.00046 | 0.00058 | 0.00061 6 | 0.00066 | 0.00095 | 0.00068 | 0.00068 7 | 0.00083 | 0.00093 | 0.00072 | 0.00072 8 | 0.00106 | 0.00124 | 0.00093 | 0.00085 9 | 0.00126 | 0.00112 | 0.00099 | 0.00097 10 | 0.00149 | 0.00185 | 0.00120 | 0.00122 11 | 0.00177 | 0.00215 | 0.00165 | 0.00163 12 | 0.00208 | 0.00245 | 0.00253 | 0.00252 13 | 0.00243 | 0.00284 | 0.00425 | 0.00432 14 | 0.00289 | 0.00334 | 0.00803 | 0.00793 15 | 0.00326 | 0.00388 | 0.03220 | 0.06548 16 | 0.00393 | 0.00475 | 0.03673 | 0.07671 17 | 0.00463 | 0.00485 | 0.04235 | 0.08364 18 | 0.00613 | 0.00786 | 0.04848 | 0.09972 19 | 0.00840 | 0.01024 | 0.06044 | 0.11577 20 | 0.01361 | 0.01905 | 0.07265 | 0.13619 21 | 0.02258 | 0.03532 | 0.09714 | 0.18999 22 | 0.04033 | 0.06544 | 0.10747 | 0.26075 23 | 0.08698 | 0.14522 | 0.18945 | 0.29765 24 | 0.28090 | 0.32265 | 0.31718 | 0.51357 25 | 0.84842 | 0.87645 | 0.91841 | 1.27589 26 | 2.55392 | 2.24508 | 1.88219 | 2.88646

CPU - simulation times - variational

nqubits | qibojit | qibotf | qiskit | qiskit-nofusion -- | -- | -- | -- | -- 3 | 0.00025 | 0.00029 | 0.00049 | 0.00051 4 | 0.00036 | 0.00042 | 0.00053 | 0.00053 5 | 0.00042 | 0.00037 | 0.00055 | 0.00054 6 | 0.00049 | 0.00047 | 0.00059 | 0.00059 7 | 0.00057 | 0.00072 | 0.00063 | 0.00074 8 | 0.00065 | 0.00077 | 0.00067 | 0.00068 9 | 0.00069 | 0.00081 | 0.00071 | 0.00070 10 | 0.00079 | 0.00095 | 0.00080 | 0.00080 11 | 0.00085 | 0.00104 | 0.00092 | 0.00088 12 | 0.00095 | 0.00108 | 0.00118 | 0.00117 13 | 0.00103 | 0.00122 | 0.00155 | 0.00158 14 | 0.00118 | 0.00108 | 0.00256 | 0.00247 15 | 0.00125 | 0.00156 | 0.00529 | 0.01913 16 | 0.00155 | 0.00180 | 0.00540 | 0.02132 17 | 0.00187 | 0.00220 | 0.00607 | 0.02221 18 | 0.00292 | 0.00343 | 0.00675 | 0.02411 19 | 0.00425 | 0.00457 | 0.00821 | 0.02657 20 | 0.00654 | 0.00897 | 0.01051 | 0.03224 21 | 0.01081 | 0.01832 | 0.01775 | 0.04365 22 | 0.02311 | 0.03407 | 0.02429 | 0.06475 23 | 0.05224 | 0.08439 | 0.03916 | 0.07442 24 | 0.18787 | 0.24245 | 0.07135 | 0.16866 25 | 0.50435 | 0.53373 | 0.16767 | 0.54643 26 | 1.08907 | 1.17849 | 0.32472 | 1.16364

CPU - simulation times - bv

nqubits | qibojit | qibotf | qiskit | qiskit-nofusion -- | -- | -- | -- | -- 3 | 0.00026 | 0.00030 | 0.00048 | 0.00050 4 | 0.00032 | 0.00037 | 0.00052 | 0.00051 5 | 0.00040 | 0.00047 | 0.00054 | 0.00052 6 | 0.00049 | 0.00065 | 0.00057 | 0.00057 7 | 0.00057 | 0.00066 | 0.00059 | 0.00061 8 | 0.00064 | 0.00185 | 0.00062 | 0.00064 9 | 0.00071 | 0.00063 | 0.00067 | 0.00067 10 | 0.00077 | 0.00118 | 0.00072 | 0.00074 11 | 0.00226 | 0.00114 | 0.00083 | 0.00089 12 | 0.00092 | 0.00170 | 0.00107 | 0.00108 13 | 0.00107 | 0.00129 | 0.00151 | 0.00148 14 | 0.00115 | 0.00124 | 0.00237 | 0.00235 15 | 0.00131 | 0.00158 | 0.00469 | 0.01967 16 | 0.00151 | 0.00161 | 0.00517 | 0.02066 17 | 0.00192 | 0.00200 | 0.00569 | 0.02271 18 | 0.00457 | 0.00324 | 0.00656 | 0.02499 19 | 0.00407 | 0.00465 | 0.00766 | 0.02845 20 | 0.00638 | 0.00887 | 0.01019 | 0.03321 21 | 0.01297 | 0.01710 | 0.01697 | 0.04855 22 | 0.02309 | 0.03547 | 0.02538 | 0.07497 23 | 0.05003 | 0.08153 | 0.04015 | 0.11676 24 | 0.19126 | 0.26410 | 0.07391 | 0.26712 25 | 0.53018 | 0.56650 | 0.16424 | 0.62133 26 | 1.18478 | 1.28438 | 0.33763 | 1.29223

CPU - simulation times - supremacy

nqubits | qibojit | qibotf | qiskit | qiskit-nofusion -- | -- | -- | -- | -- 3 | 0.00031 | 0.00115 | 0.00052 | 0.00051 4 | 0.00038 | 0.00039 | 0.00054 | 0.00054 5 | 0.00046 | 0.00049 | 0.00057 | 0.00057 6 | 0.00055 | 0.00056 | 0.00061 | 0.00061 7 | 0.00065 | 0.00064 | 0.00063 | 0.00063 8 | 0.00071 | 0.00068 | 0.00067 | 0.00105 9 | 0.00079 | 0.00076 | 0.00076 | 0.00072 10 | 0.00085 | 0.00088 | 0.00081 | 0.00081 11 | 0.00095 | 0.00096 | 0.00093 | 0.00101 12 | 0.00104 | 0.00110 | 0.00122 | 0.00127 13 | 0.00116 | 0.00116 | 0.00287 | 0.00175 14 | 0.00131 | 0.00135 | 0.00306 | 0.00308 15 | 0.00146 | 0.00154 | 0.00470 | 0.02359 16 | 0.00176 | 0.00196 | 0.00484 | 0.02600 17 | 0.00215 | 0.00261 | 0.00500 | 0.02858 18 | 0.00340 | 0.00353 | 0.00623 | 0.03187 19 | 0.00481 | 0.00565 | 0.00725 | 0.03511 20 | 0.00783 | 0.00965 | 0.00933 | 0.04261 21 | 0.01568 | 0.01997 | 0.01557 | 0.05924 22 | 0.02940 | 0.03944 | 0.02405 | 0.07921 23 | 0.07014 | 0.08953 | 0.03592 | 0.09555 24 | 0.25279 | 0.24751 | 0.06739 | 0.20516 25 | 0.63385 | 0.68515 | 0.14553 | 0.69297 26 | 1.33916 | 1.41088 | 0.28359 | 1.43705

CPU - simulation times - bc

nqubits | qibojit | qibotf | qiskit | qiskit-nofusion -- | -- | -- | -- | -- 3 | 0.00203 | 0.00264 | 0.00115 | 0.00145 4 | 0.00514 | 0.00526 | 0.00260 | 0.00179 5 | 0.00815 | 0.00790 | 0.00264 | 0.00332 6 | 0.01170 | 0.01159 | 0.00389 | 0.00381 7 | 0.01666 | 0.01571 | 0.00527 | 0.00522 8 | 0.02274 | 0.01735 | 0.00737 | 0.01796 9 | 0.02237 | 0.02723 | 0.01975 | 0.01998 10 | 0.03307 | 0.03353 | 0.02533 | 0.02082 11 | 0.03635 | 0.03972 | 0.02647 | 0.02681 12 | 0.04149 | 0.04628 | 0.05496 | 0.04923 13 | 0.04948 | 0.05336 | 0.09865 | 0.11620 14 | 0.06095 | 0.05566 | 0.21272 | 0.21620

CPU - simulation times - qv

nqubits | qibojit | qibotf | qiskit | qiskit-nofusion -- | -- | -- | -- | -- 3 | 0.00032 | 0.00031 | 0.00053 | 0.00053 4 | 0.00092 | 0.00056 | 0.00063 | 0.00066 5 | 0.00059 | 0.00057 | 0.00065 | 0.00067 6 | 0.00086 | 0.00084 | 0.00073 | 0.00072 7 | 0.00086 | 0.00085 | 0.00073 | 0.00079 8 | 0.00117 | 0.00109 | 0.00084 | 0.00083 9 | 0.00113 | 0.00109 | 0.00086 | 0.00086 10 | 0.00139 | 0.00137 | 0.00110 | 0.00102 11 | 0.00139 | 0.00140 | 0.00112 | 0.00113 12 | 0.00167 | 0.00164 | 0.00159 | 0.00167 13 | 0.00175 | 0.00178 | 0.00296 | 0.00214 14 | 0.00211 | 0.00211 | 0.00379 | 0.00388 15 | 0.00246 | 0.00219 | 0.00478 | 0.03039 16 | 0.00305 | 0.00289 | 0.00459 | 0.03687 17 | 0.00352 | 0.00374 | 0.00544 | 0.03616 18 | 0.00542 | 0.00558 | 0.00647 | 0.04268 19 | 0.00805 | 0.00856 | 0.00724 | 0.04680 20 | 0.01417 | 0.01614 | 0.00917 | 0.05700 21 | 0.02470 | 0.03082 | 0.01379 | 0.07853 22 | 0.05245 | 0.06647 | 0.02163 | 0.11175 23 | 0.11709 | 0.14571 | 0.03694 | 0.14163 24 | 0.29440 | 0.40059 | 0.06753 | 0.37110 25 | 1.01247 | 1.07168 | 0.12397 | 1.03992 26 | 2.17764 | 2.24001 | 0.24122 | 2.36563

CPU - simulation times - hs

nqubits | qibojit | qibotf | qiskit | qiskit-nofusion -- | -- | -- | -- | -- 3 | 0.00037 | 0.00042 | 0.00058 | 0.00057 4 | 0.00059 | 0.00065 | 0.00062 | 0.00063 5 | 0.00064 | 0.00071 | 0.00064 | 0.00065 6 | 0.00081 | 0.00096 | 0.00071 | 0.00071 7 | 0.00093 | 0.00109 | 0.00076 | 0.00073 8 | 0.00110 | 0.00126 | 0.00081 | 0.00081 9 | 0.00113 | 0.00212 | 0.00085 | 0.00085 10 | 0.00143 | 0.00155 | 0.00101 | 0.00100 11 | 0.00145 | 0.00161 | 0.00111 | 0.00114 12 | 0.00166 | 0.00166 | 0.00161 | 0.00159 13 | 0.00180 | 0.00263 | 0.00223 | 0.00230 14 | 0.00213 | 0.00201 | 0.00433 | 0.00395 15 | 0.00230 | 0.00265 | 0.00340 | 0.03291 16 | 0.00260 | 0.00320 | 0.00354 | 0.03397 17 | 0.00309 | 0.00382 | 0.00362 | 0.03420 18 | 0.00585 | 0.00527 | 0.00491 | 0.04197 19 | 0.00896 | 0.00955 | 0.00653 | 0.04875 20 | 0.01241 | 0.01585 | 0.00946 | 0.05614 21 | 0.02027 | 0.02632 | 0.01529 | 0.07261 22 | 0.03742 | 0.05523 | 0.02241 | 0.10516 23 | 0.08511 | 0.11802 | 0.04046 | 0.11220 24 | 0.36213 | 0.35885 | 0.07321 | 0.27944 25 | 1.05356 | 1.05197 | 0.12127 | 1.08479 26 | 1.92000 | 1.97689 | 0.24272 | 2.05238 27 | 3.65685 | 3.83122 | 0.49052 | 3.88389 28 | 7.95587 | 8.27206 | 0.97463 | 8.39834 29 | 18.97606 | 19.55705 | 1.89644 | 19.96811 30 | 33.93688 | 35.07520 | 3.71810 | 35.70113

So results are pretty much similar with the exception of dry run times from small qubit numbers. I am not sure if this can be improved if we disable parallelization for nqubits < 14 as Qiskit does by default.

We should look into that and check if qibo can support it.

I agree we should revisit gate fusion in Qibo and if performance is improved so much for most common circuits we could consider making default with some cut-off in the number of qubits. We should open an issue about that in Qibo.

By the way, I added the option to use Qiskit without fusion in the benchmark script (via --library qiskit-nofusion) and also GPU support (--library qiskit-gpu and --library qulacs-gpu). I noticed that when using Qiskit GPU the final state returned is wrong, tests do not pass and if I print the final state from benchmark it is different than other backends (including Qiskit CPU).

I'm not yet sure if this is a bug with Qiskit or a problem in our code but will investigate it further (just noting it in case you try to run something in the meantime).

@stavros11 thank you very much for these numbers and confirmation. I agree concerning fusion and the possibility to set threads automatically, as you have posted in the issue. I will try the new GPU implementations tomorrow.

@stavros11 2 points:

which tests are failing for you with qiskit-gpu?
the qiskit-gpu performance does not change with the fusion_enable flag, does this happen also for you?

Quick response before I take off for Abu Dhabi:

which tests are failing for you with qiskit-gpu?

I was checking this thoroughly yesterday and interestingly the problem exists only on my local machine. I tried both DGX and qibo machine and qiskit-gpu works well there. In my machine I get errors even when using simple qiskit circuits, without all the benchmark code we have here. I’ll give a simple script later. I’m not sure if it is related to CUDA version or something is wrong in my configuration. I followed the same installation procedure everywhere (just pip install qiskit-aer-gpu).

So I believe the code here is okay to try GPU benchmarks as it is. We just need to expand by adding QCGPU and HyQuas.

the qiskit-gpu performance does not change with the fusion_enable flag, does this happen also for you?

I haven’t checked how fusion affects GPU yet.

@stavros11, tests are passing on my pc however, if I print the result during dry run and simulation run (like a manual --transfer with nrep=1) I get:

for qibojit CPU/GPU and qiskit CPU I get sensible results:

[0.00012207+0.j 0.00012207+0.j 0.00012207+0.j ... 0.00012207+0.j 0.00012207+0.j 0.00012207+0.j]
[benchmarks|INFO|2021-08-16 21:43:11]: dry_run_transfer_time: 0.0006430149078369141
[0.00012207+0.j 0.00012207+0.j 0.00012207+0.j ... 0.00012207+0.j 0.00012207+0.j 0.00012207+0.j]

however for qiskit-gpu, the performance is quite strange (~4x faster than qibojit), and I get these wrong prints:

[1.+0.j 0.+0.j 0.+0.j ... 0.+0.j 0.+0.j 0.+0.j] 
[benchmarks|INFO|2021-08-16 21:42:39]: dry_run_transfer_time: 0.0003075599670410156
[1.    +0.j 0.0625+0.j 0.0625+0.j ... 0.    +0.j 0.    +0.j 0.    +0.j]

Does this happen for you?

@stavros11, tests are passing on my pc

Note that the tests that are uploaded on GitHub do not test the GPU backends. In order to test these you have to include "qiskit-gpu" and "qulacs-gpu" in the LIBRARIES list in conftest.py.

for qibojit CPU/GPU and qiskit CPU I get sensible results:

[0.00012207+0.j 0.00012207+0.j 0.00012207+0.j ... 0.00012207+0.j 0.00012207+0.j 0.00012207+0.j]
[benchmarks|INFO|2021-08-16 21:43:11]: dry_run_transfer_time: 0.0006430149078369141
[0.00012207+0.j 0.00012207+0.j 0.00012207+0.j ... 0.00012207+0.j 0.00012207+0.j 0.00012207+0.j]

however for qiskit-gpu, the performance is quite strange (~4x faster than qibojit), and I get these wrong prints:

[1.+0.j 0.+0.j 0.+0.j ... 0.+0.j 0.+0.j 0.+0.j] 
[benchmarks|INFO|2021-08-16 21:42:39]: dry_run_transfer_time: 0.0003075599670410156
[1.    +0.j 0.0625+0.j 0.0625+0.j ... 0.    +0.j 0.    +0.j 0.    +0.j]

Does this happen for you?

Yes, I observe some strange behavior from qiskit-gpu in all machines. If I add "qiskit-gpu" in the tests, they fail on my machine but pass on Qibomachine. However when I print the state during the benchmark as in your example, I get wrong results in all machine. Also the final state changes if I run the same script more than once even though there is nothing random involved.

Here is a simple script that reproduces these issues:

import qiskit
from qiskit.providers.aer import StatevectorSimulator

def main(nqubits, nreps, gpu, transpile):
    for _ in range(nreps):
        circuit = qiskit.QuantumCircuit(nqubits)
        for i in range(nqubits):
            circuit.h(i)

        if gpu:
            simulator = StatevectorSimulator(method="statevector_gpu")
        else:
            simulator = StatevectorSimulator()

        if transpile:
            circuit = qiskit.transpile(circuit, simulator)

        print("nqubits:", nqubits)
        print("nreps:", nreps)
        print("gpu:", gpu)
        print("transpile:", transpile)
        result = simulator.run(circuit).result()
        print(result.get_statevector(circuit))
        print()

@scarrazza, if you try to run this with gpu = True and nreps > 1 it is very likely that you will get different states between each repetition even though the same circuit is simulated. If you run the same script more than once you may also get different states in each run. Currently in qibo machine the problem appears only when nqubits >= 10, however in my local machine I get even for two qubits.

@stavros11 I confirm all your points. I was monitoring the GPU usage on different systems while running pytest and I realized that only in the qibomachine it doesn't seem to use any GPU % during tests, so maybe it is falling back to CPU (I think qiskit provides some get_device method to check if the backend is using CPU or GPU).

Did you try the qft using qiskit.*.library.QFT directly?

Did you try the qft using qiskit.*.library.QFT directly?

If I replace the circuit creation with qiskit.circuit.library.QFT in the above script the problem remains for GPU. Note that for the built-in QFT I have to use the transpile option otherwise I get a different error when attempting to get the statevector on both CPU and GPU:

qiskit.exceptions.QiskitError: 'Data for experiment "QFT" could not be found.'

@stavros11 I just monitor the pytest performance on test_libraries for 5, 10, 15, 26 qubits. Tests are failing for 15 and 26, for these tests I can see GPU usage high and CPU usage low, however for <= 10 the CPU usage is very high and GPU is low. So I assume they have some fallback mechanism which selects the appropriate hardware.

As discussed today, let me suggest to complete the other libraries listed in the first post, and perform a final decision afterwards.

@stavros11 concerning qiskit, I have opened this issue https://github.com/Qiskit/qiskit-aer/issues/1319, and they have proposed a fix in this PR https://github.com/Qiskit/qiskit-aer/pull/1325. So it is a qiskit bug.

@stavros11 I have installed the aer master locally and indeed the GPU problem is fixed. On the other hand their performance is a factor 2x slower than qibojit.

@scarrazza here are some plots using the circuits and libraries we have so far for CPU:

Total dry run time (= creation + dry run) for QFT

![image](https://user-images.githubusercontent.com/35475381/136701314-2982b3c4-d454-420f-9531-82f6d330bb77.png)

Total time (= creation + simulation) for QFT

![image](https://user-images.githubusercontent.com/35475381/136701322-7e04edae-f85d-45a4-a55b-b2f3f94c1a6c.png)

Dry run time for 20 qubits

![image](https://user-images.githubusercontent.com/35475381/136701336-2e3c4359-ef13-4aed-bc31-1231e64bd5a8.png)

Simulation time for 20 qubits

![image](https://user-images.githubusercontent.com/35475381/136701344-5a0a6646-2f90-4d6a-8889-3f3c9caafc45.png)

Total dry run time (creation + dry run) for 20 qubits

![image](https://user-images.githubusercontent.com/35475381/136701357-b4263fe4-57de-4139-9770-80032a7d0fc7.png)

Total simulation time (creation + simulation) for 20 qubits

![image](https://user-images.githubusercontent.com/35475381/136701354-d78963c0-8669-4f36-b1b1-4adb06fb92b5.png)

Total dry run time (creation + dry run) for 25 qubits

![image](https://user-images.githubusercontent.com/35475381/136701448-5684bb29-f654-4ba8-9791-10c4e0931999.png)

Total simulation time (creation + simulation) for 25 qubits

![image](https://user-images.githubusercontent.com/35475381/136701445-a2077954-8722-42b7-96ef-ec34543db4dc.png)

It seems that creation time is the main bottleneck for some libraries and circuits. This is the time required to convert the circuit from Qasm to the library's format. For Qulacs I do this conversion manually as I could not find a qasm parser on their docs but for Qiskit I am using QuantumCircuit.from_qasm_str. Anyway, this time is logged seperately from the simulation and dry run times so we may choose to not include it in the plots if we wish, even though it will appear when simulating in practice as the circuit needs to be created.

Other than that, I will try to run some single precision benchmarks with Qibo, Cirq, Qiskit and TFQ because I could not find how to switch TFQ to double and also some GPU benchmarks with Qibo, QCGPU, Qulacs and Qiskit (if their GPU simulator is fixed). Let me know what other configurations and plots would be interesting.

Cool, thanks for these interesting results.

I think we should have a look at the dry-run, I have the suspicious our initialization is not 100% due to *jit, but maybe the object allocations (gate matrices, etc...).

I think we should have a look at the dry-run, I have the suspicious our initialization is not 100% due to *jit, but maybe the object allocations (gate matrices, etc...).

That is a good point and makes sense because some elements such as gate matrices are allocated during the first execution (which is the dry run) and cached for subsequent runs. However I tried executing the benchmark by recreating a new circuit object before every execution (dry run and simulation) and the difference between dry run and simulation remains. Here are some numbers:

qft

nqubits | dry run | dry run (recreation) | simulation | simulation (recreation) -- | -- | -- | -- | -- 3 | 0.15707 | 0.15781 | 0.00016 | 0.00028 4 | 0.15898 | 0.15498 | 0.00023 | 0.00044 5 | 0.15525 | 0.15473 | 0.00031 | 0.00054 6 | 0.15515 | 0.15456 | 0.00041 | 0.00080 7 | 0.15875 | 0.15547 | 0.00054 | 0.00097 8 | 0.15593 | 0.15827 | 0.00068 | 0.00127 9 | 0.16144 | 0.15744 | 0.00085 | 0.00158 10 | 0.15839 | 0.15660 | 0.00097 | 0.00186 11 | 0.15855 | 0.15709 | 0.00117 | 0.00216 12 | 0.15825 | 0.15653 | 0.00138 | 0.00279 13 | 0.15841 | 0.15906 | 0.00167 | 0.00308 14 | 0.16396 | 0.15718 | 0.00193 | 0.00368 15 | 0.16583 | 0.15940 | 0.00247 | 0.00430 16 | 0.16377 | 0.16107 | 0.00311 | 0.00538 17 | 0.16657 | 0.16659 | 0.00478 | 0.00747 18 | 0.17491 | 0.16635 | 0.00820 | 0.00985 19 | 0.17749 | 0.17763 | 0.01323 | 0.01655 20 | 0.18860 | 0.18337 | 0.02189 | 0.02375 21 | 0.22588 | 0.19885 | 0.04432 | 0.04593 22 | 0.34758 | 0.31781 | 0.15806 | 0.16899 23 | 0.68696 | 0.71270 | 0.52971 | 0.53924 24 | 1.45951 | 1.45149 | 1.28259 | 1.29806 25 | 2.92720 | 2.91710 | 2.75042 | 2.76086 26 | 6.04791 | 5.99328 | 5.83122 | 5.84658 27 | 12.52631 | 12.50224 | 12.85258 | 12.28759 28 | 26.20651 | 26.31926 | 27.02812 | 26.11188

variational

nqubits | dry run | dry run (recreation) | simulation | simulation (recreation) -- | -- | -- | -- | -- 3 | 0.14938 | 0.15226 | 0.00018 | 0.00037 4 | 0.15371 | 0.15332 | 0.00025 | 0.00053 5 | 0.15404 | 0.15019 | 0.00027 | 0.00058 6 | 0.15008 | 0.15739 | 0.00031 | 0.00072 7 | 0.15250 | 0.15245 | 0.00035 | 0.00078 8 | 0.15032 | 0.15541 | 0.00044 | 0.00089 9 | 0.15122 | 0.15354 | 0.00046 | 0.00097 10 | 0.15075 | 0.15808 | 0.00051 | 0.00108 11 | 0.17387 | 0.15102 | 0.00058 | 0.00113 12 | 0.15500 | 0.15272 | 0.00083 | 0.00128 13 | 0.15667 | 0.15273 | 0.00071 | 0.00159 14 | 0.15319 | 0.15159 | 0.00083 | 0.00162 15 | 0.15240 | 0.15151 | 0.00097 | 0.00182 16 | 0.15540 | 0.15854 | 0.00132 | 0.00249 17 | 0.15737 | 0.15790 | 0.00239 | 0.00345 18 | 0.16196 | 0.16037 | 0.00428 | 0.00534 19 | 0.16351 | 0.16368 | 0.00775 | 0.00891 20 | 0.17372 | 0.16792 | 0.01338 | 0.01195 21 | 0.19686 | 0.19006 | 0.02439 | 0.02384 22 | 0.26057 | 0.26137 | 0.09740 | 0.10637 23 | 0.43137 | 0.43235 | 0.26274 | 0.26495 24 | 0.73790 | 0.71873 | 0.56391 | 0.56679 25 | 1.33550 | 1.33740 | 1.15508 | 1.16081 26 | 2.58018 | 2.60355 | 2.44414 | 2.43976 27 | 5.11979 | 5.15588 | 4.93943 | 4.95783 28 | 10.57878 | 10.61298 | 10.44146 | 10.46713

Thanks for checking, this sounds like some for loop overhead. At some point, after completing the codes / libs for this exercise, one should go step by step and profile the function calls and identify where we loose performance.

Thanks for checking, this sounds like some for loop overhead. At some point, after completing the codes / libs for this exercise, one should go step by step and profile the function calls and identify where we loose performance.

I am not sure if this helps, but I tried profiling the benchmark script using cProfile and I noticed that the difference between the logged dry run time and simulation time is similar to the cumulative time of numba's Dispatcher.compile which is logged in the profiling result file. So I tried profiling for multiple qubit number and circuit configurations and it appears that there is some kind of agreement:

qft

nqubits | dry run | simulation | numba compile | dry run - compile - simulation -- | -- | -- | -- | -- 3 | 0.23906 | 0.00029 | 0.23 | 0.00877 4 | 0.23490 | 0.00042 | 0.226 | 0.00848 5 | 0.23044 | 0.00055 | 0.221 | 0.00889 6 | 0.23567 | 0.00072 | 0.226 | 0.00895 7 | 0.24020 | 0.00097 | 0.23 | 0.00923 8 | 0.24235 | 0.00115 | 0.231 | 0.01020 9 | 0.24071 | 0.00142 | 0.23 | 0.00929 10 | 0.24070 | 0.00167 | 0.229 | 0.01002 11 | 0.24643 | 0.00188 | 0.234 | 0.01056 12 | 0.24060 | 0.00237 | 0.227 | 0.01123 13 | 0.23433 | 0.00258 | 0.221 | 0.01076 14 | 0.24708 | 0.00324 | 0.232 | 0.01183 15 | 0.24257 | 0.00379 | 0.227 | 0.01178 16 | 0.24428 | 0.00498 | 0.227 | 0.01230 17 | 0.24666 | 0.00725 | 0.226 | 0.01341 18 | 0.26335 | 0.01618 | 0.234 | 0.01317 19 | 0.26668 | 0.01708 | 0.232 | 0.01760 20 | 0.26835 | 0.02764 | 0.228 | 0.01271 21 | 0.31629 | 0.04397 | 0.234 | 0.03831 22 | 0.45049 | 0.17176 | 0.23 | 0.04873 23 | 0.84555 | 0.57817 | 0.229 | 0.03838 24 | 1.60636 | 1.32032 | 0.23 | 0.05604 25 | 3.03952 | 2.83520 | 0.233 | -0.02868 26 | 6.25979 | 5.97193 | 0.232 | 0.05586

variational

nqubits | dry run | simulation | numba compile | dry run - compile - simulation -- | -- | -- | -- | -- 3 | 0.22925 | 0.00031 | 0.221 | 0.00794 4 | 0.22741 | 0.00039 | 0.219 | 0.00802 5 | 0.23640 | 0.00047 | 0.227 | 0.00893 6 | 0.22941 | 0.00058 | 0.22 | 0.00883 7 | 0.22409 | 0.00059 | 0.215 | 0.00850 8 | 0.23157 | 0.00073 | 0.222 | 0.00884 9 | 0.23885 | 0.00082 | 0.229 | 0.00904 10 | 0.23040 | 0.00085 | 0.221 | 0.00855 11 | 0.23060 | 0.00091 | 0.221 | 0.00869 12 | 0.23717 | 0.00104 | 0.227 | 0.00913 13 | 0.23834 | 0.00113 | 0.227 | 0.01021 14 | 0.23341 | 0.00138 | 0.221 | 0.01102 15 | 0.22592 | 0.00147 | 0.215 | 0.00945 16 | 0.26394 | 0.00218 | 0.25 | 0.01175 17 | 0.23769 | 0.00374 | 0.221 | 0.01294 18 | 0.23274 | 0.00395 | 0.218 | 0.01079 19 | 0.24199 | 0.00633 | 0.223 | 0.01266 20 | 0.24995 | 0.01121 | 0.225 | 0.01374 21 | 0.27429 | 0.02264 | 0.229 | 0.02265 22 | 0.35193 | 0.10792 | 0.224 | 0.02001 23 | 0.51482 | 0.27490 | 0.223 | 0.01691 24 | 0.83086 | 0.57535 | 0.23 | 0.02551 25 | 1.38692 | 1.18283 | 0.225 | -0.02091 26 | 2.80882 | 2.47557 | 0.222 | 0.11125

supremacy

nqubits | dry run | simulation | numba compile | dry run - compile - simulation -- | -- | -- | -- | -- 3 | 0.25817 | 0.00036 | 0.25 | 0.00781 4 | 0.26056 | 0.00043 | 0.252 | 0.00813 5 | 0.26628 | 0.00052 | 0.257 | 0.00877 6 | 0.26703 | 0.00059 | 0.258 | 0.00845 7 | 0.26437 | 0.00068 | 0.255 | 0.00868 8 | 0.27798 | 0.00084 | 0.268 | 0.00913 9 | 0.28658 | 0.00083 | 0.277 | 0.00875 10 | 0.26633 | 0.00093 | 0.256 | 0.00939 11 | 0.26570 | 0.00098 | 0.256 | 0.00872 12 | 0.27453 | 0.00115 | 0.264 | 0.00938 13 | 0.27708 | 0.00122 | 0.266 | 0.00986 14 | 0.28162 | 0.00146 | 0.271 | 0.00917 15 | 0.26878 | 0.00181 | 0.257 | 0.00997 16 | 0.27061 | 0.00226 | 0.259 | 0.00935 17 | 0.27021 | 0.00395 | 0.256 | 0.01025 18 | 0.27924 | 0.00699 | 0.261 | 0.01125 19 | 0.27765 | 0.00779 | 0.257 | 0.01287 20 | 0.27933 | 0.01420 | 0.253 | 0.01213 21 | 0.32220 | 0.04193 | 0.26 | 0.02027 22 | 0.40820 | 0.12379 | 0.259 | 0.02541 23 | 0.62679 | 0.35215 | 0.263 | 0.01164 24 | 1.00601 | 0.70773 | 0.263 | 0.03527 25 | 1.76145 | 1.48079 | 0.261 | 0.01966 26 | 3.42488 | 3.10351 | 0.261 | 0.06037

Here dry run and simulation are logged by the benchmark script, while the numba compile is read from the cProfile output as the cumulative time of the Dispatcher.compile calls. It appears that this function explains a large part of the dry run overhead. The only thing that I cannot really explain is the negative difference that appears in two cases for 25 qubits qft and variational.

Here are some single precision plots including tfq:

dry run time QFT

![image](https://user-images.githubusercontent.com/35475381/136995048-78fff9b7-f1cb-4ab9-b5c6-10f59554a060.png)

simulation time QFT

![image](https://user-images.githubusercontent.com/35475381/136995092-a8b8f44c-6189-4e26-97e3-0450ecd0d23a.png)

dry run time 20 qubits

![image](https://user-images.githubusercontent.com/35475381/136995114-a739f1ab-6adb-4c78-a8ab-29dd64c15965.png)

simulation time 20 qubits

![image](https://user-images.githubusercontent.com/35475381/136995121-448647c8-bdf3-44d6-afc8-ec724e386cba.png)

dry run time 28 qubits

![image](https://user-images.githubusercontent.com/35475381/136995135-f0d940b6-dc95-412e-89e6-f6abd50aae5c.png)

simulation time 28 qubits

![image](https://user-images.githubusercontent.com/35475381/136995153-1a705c65-1084-4d84-891d-be0d0cbdba10.png)

Here dry run and simulation are logged by the benchmark script, while the numba compile is read from the cProfile output as the cumulative time of the Dispatcher.compile calls. It appears that this function explains a large part of the dry run overhead. The only thing that I cannot really explain is the negative difference that appears in two cases for 25 qubits qft and variational.

Thanks @stavros11, could you please rerun one of these examples by removing the cache=True flag? I find quite strange that we have a ~0.23s for loading, maybe this cache flag is not working?

Concerning the plots, do you understand why qiskit dry run is faster than simulation for 28 qubits qft?

Concerning the plots, do you understand why qiskit dry run is faster than simulation for 28 qubits qft?

I believe for qiskit there is no big difference between dry run and simulation for high number of qubits, here are the absolute numbers used in these plots for qiskit:

qiskit QFT - single precision

nqubits | dry run | simulation -- | -- | -- 3 | 0.00159 | 0.00069 4 | 0.00186 | 0.00079 5 | 0.00202 | 0.00089 6 | 0.00211 | 0.00107 7 | 0.00206 | 0.00118 8 | 0.00235 | 0.00134 9 | 0.00271 | 0.00161 10 | 0.00288 | 0.00223 11 | 0.00355 | 0.00252 12 | 0.00423 | 0.00352 13 | 0.00654 | 0.00536 14 | 0.01132 | 0.00901 15 | 0.02059 | 0.00812 16 | 0.03967 | 0.01069 17 | 0.02152 | 0.01465 18 | 0.02117 | 0.01719 19 | 0.04854 | 0.02005 20 | 0.03160 | 0.02392 21 | 0.07552 | 0.03413 22 | 0.08117 | 0.07178 23 | 0.12557 | 0.11821 24 | 0.29292 | 0.24620 25 | 0.53688 | 0.51159 26 | 1.27298 | 1.28783 27 | 3.03817 | 3.03866 28 | 6.38377 | 6.44199 29 | 13.37642 | 13.49525

Note that in the bar plots I am normalizing all times with respect to qibo, that's why qibo is always 1.

Thanks @stavros11, could you please rerun one of these examples by removing the cache=True flag? I find quite strange that we have a ~0.23s for loading, maybe this cache flag is not working?

Thanks for this proposal. I removed the cache=True from all numba kernels and retried the same measurement:

qft - no cache

nqubits | dry run | simulation | numba compile | dry run - compile - simulation -- | -- | -- | -- | -- 3 | 2.18240 | 0.00031 | 2.169 | 0.01309 4 | 2.19097 | 0.00045 | 2.18 | 0.01053 5 | 2.12866 | 0.00055 | 2.117 | 0.01111 6 | 2.16429 | 0.00082 | 2.152 | 0.01146 7 | 2.14377 | 0.00097 | 2.132 | 0.01080 8 | 2.12543 | 0.00120 | 2.113 | 0.01123 9 | 2.12899 | 0.00144 | 2.116 | 0.01155 10 | 2.19033 | 0.00174 | 2.176 | 0.01259 11 | 2.14959 | 0.00196 | 2.135 | 0.01263 12 | 2.13091 | 0.00232 | 2.116 | 0.01259 13 | 2.13235 | 0.00269 | 2.116 | 0.01366 14 | 2.17418 | 0.00328 | 2.157 | 0.01390 15 | 2.15531 | 0.00394 | 2.137 | 0.01438 16 | 2.13725 | 0.00530 | 2.116 | 0.01596 17 | 2.18406 | 0.00928 | 2.156 | 0.01878 18 | 2.18813 | 0.01175 | 2.157 | 0.01938 19 | 2.17092 | 0.01896 | 2.129 | 0.02296 20 | 2.21968 | 0.03220 | 2.157 | 0.03048 21 | 2.25038 | 0.06083 | 2.155 | 0.03455 22 | 2.40288 | 0.19170 | 2.17 | 0.04118 23 | 2.77648 | 0.57281 | 2.17 | 0.03367 24 | 3.51100 | 1.33851 | 2.142 | 0.03049 25 | 5.02243 | 2.83458 | 2.162 | 0.02585 26 | 8.33578 | 6.05789 | 2.238 | 0.03989

variational - no cache

nqubits | dry run | simulation | numba compile | dry run - compile - simulation -- | -- | -- | -- | -- 3 | 1.56778 | 0.00031 | 1.557 | 0.01047 4 | 1.59130 | 0.00053 | 1.58 | 0.01077 5 | 1.56405 | 0.00047 | 1.553 | 0.01058 6 | 1.56076 | 0.00057 | 1.55 | 0.01019 7 | 1.56417 | 0.00080 | 1.553 | 0.01037 8 | 1.57252 | 0.00071 | 1.561 | 0.01081 9 | 1.56282 | 0.00079 | 1.551 | 0.01103 10 | 1.57679 | 0.00089 | 1.565 | 0.01090 11 | 1.60140 | 0.00096 | 1.589 | 0.01144 12 | 1.59974 | 0.00110 | 1.587 | 0.01164 13 | 1.58947 | 0.00121 | 1.577 | 0.01126 14 | 1.59535 | 0.00149 | 1.582 | 0.01186 15 | 1.57737 | 0.00181 | 1.564 | 0.01156 16 | 1.57664 | 0.00249 | 1.562 | 0.01216 17 | 1.58284 | 0.00337 | 1.566 | 0.01347 18 | 1.61697 | 0.00601 | 1.596 | 0.01496 19 | 1.61458 | 0.01012 | 1.586 | 0.01846 20 | 1.59645 | 0.01851 | 1.554 | 0.02394 21 | 1.62758 | 0.03469 | 1.567 | 0.02589 22 | 1.70539 | 0.11182 | 1.57 | 0.02358 23 | 1.86720 | 0.27601 | 1.561 | 0.03019 24 | 2.17463 | 0.59181 | 1.558 | 0.02482 25 | 2.80981 | 1.20702 | 1.579 | 0.02379 26 | 4.16465 | 2.56996 | 1.577 | 0.01769

So the cache option seems to work as it reduces the compilation time from 2sec to 0.2sec. Also, the observation that (dry run) - (simulation) - (numba compile call) = (almost 0) that we saw above still holds when not using the cache.

Following our discussion, I started producing some plots that compare the times required for each part:

import
circuit creation
dry run
simulation after dry run

seperately, as well as a total script ( = import + circuit creation + dry run) comparison. During this I noticed that Qibo's import time is significantly longer than other libraries, so I tabulated the import time for various configurations on Qibo machine:

library	import time (sec)
qulacs	0.00479
qiskit	0.64582
cirq	1.07991
qibo (numpy)	0.27475
qibo (numpy + qibojit)	0.48673
qibo (numpy + qibojit + tensorflow)	1.41568
qibo (numpy + qibojit + tensorflow + qibotf)	1.66606

It appears that our import time is very long because during import qibo we initialize all backends that are available. So if tensorflow is installed we will import it even if we don't use it. If we fix this, the above numbers show that our import time with qibojit as default will fall to about 0.15sec below qiskit's import time which will counterbalance the 0.2sec loss from dry run.

I will open a PR in Qibo where I update the backend initialization procedure so that only the default backend is created and tensorflow is imported only if the user switches to the corresponding backend and then I will repeat the above comparison. Then, I believe our numbers will be competitive even when considering the dry run, it will just be a matter of whether we want to keep the overhead in dry run or move it to import. @scarrazza let me know if you agree.

@stavros11 thank you very much for this numbers. Indeed, as expected tf slows down a lot the initialization. So, yes please ahead with the tf import cleanup.

it will just be a matter of whether we want to keep the overhead in dry run or move it to import. @scarrazza let me know if you agree.

I think the tracing idea has some advantage, in particular the fact that simulation performance is reproducible without a dry run stage. I have a recollection that if you set the signature of numba functions explicitly, the function is compiled/loaded at import time automatically, but I not sure this is the case with the latest numba version.

Here are some results on GPU after the import time fixed. I checked and the latest version of qiskit-aer seems to give right results on GPU.

QFT - creation time

![image](https://user-images.githubusercontent.com/35475381/137349189-04f3b1f6-950c-4f95-9e35-03f828de3ecb.png)

QFT - dry run time

![image](https://user-images.githubusercontent.com/35475381/137349221-41a17837-9478-4c98-99f2-a9543ae7fcba.png)

QFT - simulation time

![image](https://user-images.githubusercontent.com/35475381/137349252-2e2d086b-b89c-4f2c-8901-cd34c67f5a8c.png)

Supremacy - creation time

![image](https://user-images.githubusercontent.com/35475381/137349282-4fe19017-9138-46e0-9a94-2020f9a1ba3b.png)

Supremacy - dry run time

![image](https://user-images.githubusercontent.com/35475381/137349303-915a38d6-1574-4877-a884-87d6eec1f24d.png)

Supremacy - simulation time

![image](https://user-images.githubusercontent.com/35475381/137349331-2d6ec2a9-c519-4844-9720-215209ded620.png)

20 qubits - total time with dry run

![image](https://user-images.githubusercontent.com/35475381/137349620-53cc1c6a-4b5d-436a-a10d-f09b9f147c68.png)

20 qubits - total time with simulation

![image](https://user-images.githubusercontent.com/35475381/137349566-394d48e4-3afd-49bb-8782-0be7bb171a7c.png)

26 qubits - total time with dry run

![image](https://user-images.githubusercontent.com/35475381/137349682-ae9da6ab-3cc8-46cb-801d-d766290ae6ef.png)

26 qubits - total time with simulation

![image](https://user-images.githubusercontent.com/35475381/137349712-28d20057-a323-4cec-95b5-30093da5b996.png)

In bar plots three different times (import + creation + simulation/dry run) are plotted on top of each other, so the total bar refers to the total time the user will get when running a script. There is no normalization so in the 26 qubit plot I removed "bc" as the corresponding bars were much taller hiding the rest circuits.

@stavros11 thanks for these plots. I am surprised by the difference between dry run and simulation, if I remember our initial tests, the cupy compilation was quite fast, say in the millisecond range, while here we see an extra 0.5s. Maybe I am missing something?

@stavros11 thanks for these plots. I am surprised by the difference between dry run and simulation, if I remember our initial tests, the cupy compilation was quite fast, say in the millisecond range, while here we see an extra 0.5s. Maybe I am missing something?

I believe this difference always existed. For example, if you see the original benchmarks when we first implemented the cupy custom operators here, there is a difference of at least 0.5sec between dry run and execution.

Given that we have OneQubitGate and TwoQubitGate circuits, shall we add an equivalent circuit with multi-qubit gates?

qiboteam / qibojit-benchmarks

Benchmark external libraries #11