Add qsim, qsim-gpu and qsim-cuquantum

mlazzarin commented 2 years ago

In this PR I added qsim (cpu), qsim-gpu and qsim-cuquantum. For qsim (cpu) I set the number of threads to multiprocessing.cpu_count(). For all, I set the max_fused_gate_size to zero. EDIT: For ``qibojit```, I disabled to compilation during import.

I also performed some benchmarks (cupy 9.6.0, cuda toolkit 11.5) for gpu.

total_dry_time: import + creation + dry run
total_simulation_time: import + creation + simulation time

qft

![qft_scaling_dry_run_time_single](https://user-images.githubusercontent.com/48728634/141726342-ad1a9da0-66c8-42ec-b9af-ef782f1fef8e.jpg) ![qft_scaling_simulation_times_mean_single](https://user-images.githubusercontent.com/48728634/141726347-683db5d6-2044-4e5d-8bee-b1b8c0951ccc.jpg) ![qft_scaling_total_dry_time_single](https://user-images.githubusercontent.com/48728634/141726362-fefd9e82-a6a6-4552-9e65-70fb81b5a677.jpg) ![qft_scaling_total_simulation_time_single](https://user-images.githubusercontent.com/48728634/141726364-20dcbf0a-dfff-44af-a81f-24d2bfff6ac0.jpg)

variational

![variational_scaling_dry_run_time_single](https://user-images.githubusercontent.com/48728634/141726601-3484edf1-8f5e-47ec-bb12-01f6cb7a5399.jpg) ![variational_scaling_simulation_times_mean_single](https://user-images.githubusercontent.com/48728634/141726602-552a3014-1afb-4046-a6b6-d028d912f5e5.jpg) ![variational_scaling_total_dry_time_single](https://user-images.githubusercontent.com/48728634/141726606-09ddcb9c-5d45-404d-93a4-4180efc8871d.jpg) ![variational_scaling_total_simulation_time_single](https://user-images.githubusercontent.com/48728634/141726608-53731cfa-3510-46cf-85c1-d65ae42aa619.jpg)

supremacy

![supremacy_scaling_dry_run_time_single](https://user-images.githubusercontent.com/48728634/141726549-7731832a-3c7a-4fe2-b3dc-e23db9bb2275.jpg) ![supremacy_scaling_simulation_times_mean_single](https://user-images.githubusercontent.com/48728634/141726556-d754ff09-62c9-4344-85f9-a609480a64bb.jpg) ![supremacy_scaling_total_dry_time_single](https://user-images.githubusercontent.com/48728634/141726567-a00f18f7-c5ce-4bb6-a0dd-55a0da001edf.jpg) ![supremacy_scaling_total_simulation_time_single](https://user-images.githubusercontent.com/48728634/141726575-bafb322f-fd9b-4c8f-8e68-a4677aa9db16.jpg)

bv

![bv_scaling_dry_run_time_single](https://user-images.githubusercontent.com/48728634/141726309-2cf4d03f-aa1b-4422-b8bd-2c40bc9be5f4.jpg) ![bv_scaling_simulation_times_mean_single](https://user-images.githubusercontent.com/48728634/141726321-2eb9badc-68a4-491f-918b-2be7fa9c8e67.jpg) ![bv_scaling_total_dry_time_single](https://user-images.githubusercontent.com/48728634/141726326-f7416d0d-9ec1-47de-b45e-796c6ca3917a.jpg) ![bv_scaling_total_simulation_time_single](https://user-images.githubusercontent.com/48728634/141726330-b3165c13-2f96-4d85-ba8a-8d0e3c234989.jpg)

qv

![qv_scaling_dry_run_time_single](https://user-images.githubusercontent.com/48728634/141726485-0c59a33a-3c82-4592-8648-101363e75631.jpg) ![qv_scaling_simulation_times_mean_single](https://user-images.githubusercontent.com/48728634/141726493-d483a622-2713-4387-8a04-eb05554dce33.jpg) ![qv_scaling_total_dry_time_single](https://user-images.githubusercontent.com/48728634/141726503-e27d1e98-23ba-44c2-8e4a-187db6ff7a59.jpg) ![qv_scaling_total_simulation_time_single](https://user-images.githubusercontent.com/48728634/141726521-cdc5053f-9982-4589-90cd-3fd636b48a98.jpg)

Some comments:

It doesn't seem that cuQuantum provides a speed-up w.r.t. qsim's C++/CUDA implementation.
Apart from the compilation overhead, we are competitive with C++/CUDA implementation, in particular with the qft circuit (maybe it's due to our approach to controlled gates?)
In these benchmarks (cupy 9.6.0, cuda toolkit 11.5) our dry run overhead is ~ 3.2 s. This is way higher than in other benchmarks I performed. I will open a new issue to discuss about it (EDIT https://github.com/qiboteam/qibojit/issues/44).
Qibojit crashes with 32 qubits, as you can see in the plots. EDIT (see https://github.com/qiboteam/qibojit/pull/43).

EDIT: I will also prepare some benchmarks with CPU.

scarrazza commented 2 years ago

@mlazzarin thanks for these tests. The cuQuantum is using a single GPU device, correct?

mlazzarin commented 2 years ago

The cuQuantum is using a single GPU device, correct?

Yes, I'm using the machine with a single NVIDIA RTX A6000. By the way, I'm not sure if qsim supports multi-GPU.

scarrazza commented 2 years ago

Ok, thanks, anyway quite good to see that we are strong XD.

mlazzarin commented 2 years ago

Here's the results for CPU. For qsim I'm using a number of threads equal to the number of logical cores, while for qibo a kepy the default value, which is half of the logical cores. (I also tried with all logical cores and it's actually slower, for small circuits)

qft

![qft_scaling_dry_run_time_single](https://user-images.githubusercontent.com/48728634/141776204-c5b73770-25e7-4f9d-beff-f8466829b253.jpg) ![qft_scaling_simulation_times_mean_single](https://user-images.githubusercontent.com/48728634/141776215-318b50b4-35d2-434a-8061-8a81b8c489e3.jpg) ![qft_scaling_total_dry_time_single](https://user-images.githubusercontent.com/48728634/141776232-963e62de-857b-40bd-89a1-9449730e4f32.jpg) ![qft_scaling_total_simulation_time_single](https://user-images.githubusercontent.com/48728634/141776244-864155d4-fd88-424e-bb45-6ab3d0f12939.jpg)

variational

![variational_scaling_dry_run_time_single](https://user-images.githubusercontent.com/48728634/141776397-91a10d06-e462-41f2-a8af-49be42b7340f.jpg) ![variational_scaling_simulation_times_mean_single](https://user-images.githubusercontent.com/48728634/141776403-70fe9adb-754b-4a77-8dce-d7152710f5a7.jpg) ![variational_scaling_total_dry_time_single](https://user-images.githubusercontent.com/48728634/141776437-f6a50539-c19f-49b2-8a26-df8fe09c0b93.jpg) ![variational_scaling_total_simulation_time_single](https://user-images.githubusercontent.com/48728634/141776438-004e4cec-b87d-46ea-b034-643ceb695c57.jpg)

supremacy

![supremacy_scaling_dry_run_time_single](https://user-images.githubusercontent.com/48728634/141776348-e476539a-7bcd-492e-a08f-428c081683e8.jpg) ![supremacy_scaling_simulation_times_mean_single](https://user-images.githubusercontent.com/48728634/141776360-318cfb71-8c3d-4e2a-ba38-816c9b322bca.jpg) ![supremacy_scaling_total_dry_time_single](https://user-images.githubusercontent.com/48728634/141776364-5f356fa2-aa2f-47e5-be19-41ec0ba310ef.jpg) ![supremacy_scaling_total_simulation_time_single](https://user-images.githubusercontent.com/48728634/141776370-a4911d08-2ed3-4235-ae25-2b640e7d5c33.jpg)

bv

![bv_scaling_dry_run_time_single](https://user-images.githubusercontent.com/48728634/141776136-e929da3b-9c59-48db-8bb2-876352cbf0c6.jpg) ![bv_scaling_simulation_times_mean_single](https://user-images.githubusercontent.com/48728634/141776141-ad42c221-5b64-4b73-8ba5-2d83ac7316d9.jpg) ![bv_scaling_total_dry_time_single](https://user-images.githubusercontent.com/48728634/141776157-4e9d4ae7-ecc2-45f2-9368-187d55e83049.jpg) ![bv_scaling_total_simulation_time_single](https://user-images.githubusercontent.com/48728634/141776181-3d5721ee-52f4-4b53-b8b1-512b20b37802.jpg)

qv

![qv_scaling_dry_run_time_single](https://user-images.githubusercontent.com/48728634/141776289-9400b16a-24ec-4c72-a32b-39655260c0a1.jpg) ![qv_scaling_simulation_times_mean_single](https://user-images.githubusercontent.com/48728634/141776297-bc157e41-56cf-492b-a51d-f2cae5dd9272.jpg) ![qv_scaling_total_dry_time_single](https://user-images.githubusercontent.com/48728634/141776305-c3c6be72-cd5c-4bba-ae63-7973afecbc91.jpg) ![qv_scaling_total_simulation_time_single](https://user-images.githubusercontent.com/48728634/141776315-6e8855c3-5ffd-48b9-8748-9991238e2cf2.jpg)

Two comments:

qsim is usually faster than qibo with large circuits, except for the qft, while qibo seems competitive with smaller circuits.
I'm not 100% sure that I was able to deactivate gate fusion in qsim. I simply set the max_fused_gate_size parameter to 0, because I didn't find a flag to disable fusion completely.

scarrazza commented 2 years ago

This sounds really like there is circuit fusion, maybe we should try to activate from qibojit and see what happens.

mlazzarin commented 2 years ago

Ok, I'm on it.

mlazzarin commented 2 years ago

Here's the results for CPU with gate fusion up to two-qubit gates and using all threads. Indeed the situation now is different.

qft - CPU

![qft_scaling_dry_run_time_single](https://user-images.githubusercontent.com/48728634/141986191-eb9820da-5da1-4ac5-b873-07bd475fdc98.jpg) ![qft_scaling_simulation_times_mean_single](https://user-images.githubusercontent.com/48728634/141986198-533589f0-2b0c-428c-8d72-d76122c3b1c8.jpg) ![qft_scaling_total_dry_time_single](https://user-images.githubusercontent.com/48728634/141986207-aaa44e1b-b7b4-4d23-a915-8ecdd8db4d24.jpg) ![qft_scaling_total_simulation_time_single](https://user-images.githubusercontent.com/48728634/141986210-134f9f83-fa48-4f2e-9875-e4c9424cac5e.jpg)

variational - CPU

![variational_scaling_dry_run_time_single](https://user-images.githubusercontent.com/48728634/141986317-5c0b9425-f9a2-443b-ae77-79e5139c6d48.jpg) ![variational_scaling_simulation_times_mean_single](https://user-images.githubusercontent.com/48728634/141986322-7a65e187-ad12-48b9-884f-69dfeb521ad8.jpg) ![variational_scaling_total_dry_time_single](https://user-images.githubusercontent.com/48728634/141986328-81fd26cd-8ae5-4eef-bab7-99d6f53545ba.jpg) ![variational_scaling_total_simulation_time_single](https://user-images.githubusercontent.com/48728634/141986330-98ced9b8-ce18-43c1-95da-05b29ec16675.jpg)

supremacy - CPU

![supremacy_scaling_dry_run_time_single](https://user-images.githubusercontent.com/48728634/141986288-f90e57ef-6fe7-4a01-bc1f-6222a8975366.jpg) ![supremacy_scaling_simulation_times_mean_single](https://user-images.githubusercontent.com/48728634/141986295-72bb5a19-a98e-4e8e-b9df-d06bf7654ca8.jpg) ![supremacy_scaling_total_dry_time_single](https://user-images.githubusercontent.com/48728634/141986300-c632eb9c-54a8-4abf-884e-71074d87b060.jpg) ![supremacy_scaling_total_simulation_time_single](https://user-images.githubusercontent.com/48728634/141986302-d0320cd2-d40a-4a18-9cb5-2bbb2d921b21.jpg)

bv - CPU

![bv_scaling_dry_run_time_single](https://user-images.githubusercontent.com/48728634/141986142-e0b2938f-6eb4-4632-83e0-f6ebb8ee76a4.jpg) ![bv_scaling_simulation_times_mean_single](https://user-images.githubusercontent.com/48728634/141986149-83b46202-dfda-451b-84e0-438af35a4bd4.jpg) ![bv_scaling_total_dry_time_single](https://user-images.githubusercontent.com/48728634/141986159-855065fd-d6fc-401f-afa7-eb529f10b13f.jpg) ![bv_scaling_total_simulation_time_single](https://user-images.githubusercontent.com/48728634/141986170-79d9a785-01b3-4aa9-b26d-3d765295b81b.jpg)

qv - CPU

![qv_scaling_dry_run_time_single](https://user-images.githubusercontent.com/48728634/141986243-3e8eb637-a5b1-4f49-8663-abbe9d82cd1f.jpg) ![qv_scaling_simulation_times_mean_single](https://user-images.githubusercontent.com/48728634/141986245-336700de-1e9a-4bbe-a776-df17893893cc.jpg) ![qv_scaling_total_dry_time_single](https://user-images.githubusercontent.com/48728634/141986253-b7cdfb08-5f98-45d4-86a1-2fc6f2e303f7.jpg) ![qv_scaling_total_simulation_time_single](https://user-images.githubusercontent.com/48728634/141986260-9a8846fd-9603-4452-be8a-40d06639362b.jpg)

I re-run the GPU benchmarks with gate fusion up to two-qubit gates, and now qibojit seems a bit faster.

qft - GPU

![qft_scaling_dry_run_time_single](https://user-images.githubusercontent.com/48728634/141986959-aecf4f37-7795-471d-9d2d-1177a5a4da56.jpg) ![qft_scaling_simulation_times_mean_single](https://user-images.githubusercontent.com/48728634/141986966-1f0eb3fc-133b-4662-bcf3-48adad059969.jpg) ![qft_scaling_total_dry_time_single](https://user-images.githubusercontent.com/48728634/141986986-8f1048a9-ff71-4874-b277-b2892f8380ad.jpg) ![qft_scaling_total_simulation_time_single](https://user-images.githubusercontent.com/48728634/141986992-fba8be98-2a00-4b6d-b051-d6fce1ec6778.jpg)

variational - GPU

![variational_scaling_dry_run_time_single](https://user-images.githubusercontent.com/48728634/141987066-e8387b56-b679-4f0c-8b42-867573a783e0.jpg) ![variational_scaling_simulation_times_mean_single](https://user-images.githubusercontent.com/48728634/141987068-23baecef-70d0-4f1e-8822-10e38bddb2ff.jpg) ![variational_scaling_total_dry_time_single](https://user-images.githubusercontent.com/48728634/141987069-845190cd-4aef-4a7b-b44d-971186c6418a.jpg) ![variational_scaling_total_simulation_time_single](https://user-images.githubusercontent.com/48728634/141987073-74cb7497-2da1-485c-b3f2-71ab098f58e0.jpg)

supremacy - GPU

![supremacy_scaling_dry_run_time_single](https://user-images.githubusercontent.com/48728634/141987045-381684b9-f8f7-4b51-8e5b-e4d9f9bb080a.jpg) ![supremacy_scaling_simulation_times_mean_single](https://user-images.githubusercontent.com/48728634/141987048-55d357df-0f13-4d1d-934e-1b3dceda4fac.jpg) ![supremacy_scaling_total_dry_time_single](https://user-images.githubusercontent.com/48728634/141987051-32b2e04f-d23e-43d6-b5ff-60c903b39756.jpg) ![supremacy_scaling_total_simulation_time_single](https://user-images.githubusercontent.com/48728634/141987056-c68b3f1c-f352-43ac-82cf-cdb55891b52c.jpg)

bv - GPU

![bv_scaling_dry_run_time_single](https://user-images.githubusercontent.com/48728634/141986905-8ae0930f-8265-49d4-a3ff-45986ed824f9.jpg) ![bv_scaling_simulation_times_mean_single](https://user-images.githubusercontent.com/48728634/141986914-bd0280de-d415-41aa-9449-9c9ca249e350.jpg) ![bv_scaling_total_dry_time_single](https://user-images.githubusercontent.com/48728634/141986938-3f1ff047-6dae-4c50-842c-e5418d3ab056.jpg) ![bv_scaling_total_simulation_time_single](https://user-images.githubusercontent.com/48728634/141986947-652585c7-b782-4030-b80a-9936894a1bfd.jpg)

qv - GPU

![qv_scaling_dry_run_time_single](https://user-images.githubusercontent.com/48728634/141987016-0587d8cf-d90c-400b-9f99-6e637ec91b32.jpg) ![qv_scaling_simulation_times_mean_single](https://user-images.githubusercontent.com/48728634/141987021-531179d2-2844-4bd6-a5c2-9f6e4036eedf.jpg) ![qv_scaling_total_dry_time_single](https://user-images.githubusercontent.com/48728634/141987027-aa26a70b-f65f-406a-9e3c-fbba608f6814.jpg) ![qv_scaling_total_simulation_time_single](https://user-images.githubusercontent.com/48728634/141987031-11b7aa34-cf1f-4a58-a757-a976dfc136f4.jpg)

scarrazza commented 2 years ago

Cool, however would be great to understand if/how they are doing the gate fusion.

mlazzarin commented 2 years ago

Cool, however would be great to understand if/how they are doing the gate fusion.

With qsim there is an option to set the maximum size of fused gates. In the last benchmarks that I posted I set that value to 2 (which is the default value). I've not found a specific flag to disable gate fusion, so in the other benchmarks I simply set that value to 0, but I don't know if it actually disable fusion or not. Concerning how they do fusion, their approach is described here https://arxiv.org/abs/2111.02396 .

scarrazza commented 2 years ago

Ok, so these last plots are comparing like with like, good.

mlazzarin commented 2 years ago

I double-checked and I believe that this implementation is the optimal one, so we may proceed with the review and then merge it to the library branch. I have only two comment left:

We still need to understand how to properly disable gate fusion, but we can worry about it in PR #17

Among the possible options of Qsim, I found this one:

   denormals_are_zeros: if true, set flush-to-zero and denormals-are-zeros
        MXCSR control flags. This prevents rare cases of performance
        slowdown potentially at the cost of a tiny precision loss.

I'm not sure if we should use it in the benchmarks or not.

mlazzarin commented 2 years ago

I fixed some gates in Cirq, now the CI works fine. Once we fix the tests for the gates, we should review each library to ensure that everything is properly implemented.

mlazzarin commented 2 years ago

Shall we merge this?

stavros11 commented 2 years ago

Yes, please go ahead and merge this and I will update randomtests to use the latest libraries so that we find any issues with gates.

qiboteam / qibojit-benchmarks

Add qsim, qsim-gpu and qsim-cuquantum #14