qiboteam / qibojit-benchmarks

Benchmark code for qibojit performance accessment
Apache License 2.0
2 stars 3 forks source link

Add qsim, qsim-gpu and qsim-cuquantum #14

Closed mlazzarin closed 2 years ago

mlazzarin commented 2 years ago

In this PR I added qsim (cpu), qsim-gpu and qsim-cuquantum. For qsim (cpu) I set the number of threads to multiprocessing.cpu_count(). For all, I set the max_fused_gate_size to zero. EDIT: For ``qibojit```, I disabled to compilation during import.

I also performed some benchmarks (cupy 9.6.0, cuda toolkit 11.5) for gpu.

qft ![qft_scaling_dry_run_time_single](https://user-images.githubusercontent.com/48728634/141726342-ad1a9da0-66c8-42ec-b9af-ef782f1fef8e.jpg) ![qft_scaling_simulation_times_mean_single](https://user-images.githubusercontent.com/48728634/141726347-683db5d6-2044-4e5d-8bee-b1b8c0951ccc.jpg) ![qft_scaling_total_dry_time_single](https://user-images.githubusercontent.com/48728634/141726362-fefd9e82-a6a6-4552-9e65-70fb81b5a677.jpg) ![qft_scaling_total_simulation_time_single](https://user-images.githubusercontent.com/48728634/141726364-20dcbf0a-dfff-44af-a81f-24d2bfff6ac0.jpg)
variational ![variational_scaling_dry_run_time_single](https://user-images.githubusercontent.com/48728634/141726601-3484edf1-8f5e-47ec-bb12-01f6cb7a5399.jpg) ![variational_scaling_simulation_times_mean_single](https://user-images.githubusercontent.com/48728634/141726602-552a3014-1afb-4046-a6b6-d028d912f5e5.jpg) ![variational_scaling_total_dry_time_single](https://user-images.githubusercontent.com/48728634/141726606-09ddcb9c-5d45-404d-93a4-4180efc8871d.jpg) ![variational_scaling_total_simulation_time_single](https://user-images.githubusercontent.com/48728634/141726608-53731cfa-3510-46cf-85c1-d65ae42aa619.jpg)
supremacy ![supremacy_scaling_dry_run_time_single](https://user-images.githubusercontent.com/48728634/141726549-7731832a-3c7a-4fe2-b3dc-e23db9bb2275.jpg) ![supremacy_scaling_simulation_times_mean_single](https://user-images.githubusercontent.com/48728634/141726556-d754ff09-62c9-4344-85f9-a609480a64bb.jpg) ![supremacy_scaling_total_dry_time_single](https://user-images.githubusercontent.com/48728634/141726567-a00f18f7-c5ce-4bb6-a0dd-55a0da001edf.jpg) ![supremacy_scaling_total_simulation_time_single](https://user-images.githubusercontent.com/48728634/141726575-bafb322f-fd9b-4c8f-8e68-a4677aa9db16.jpg)
bv ![bv_scaling_dry_run_time_single](https://user-images.githubusercontent.com/48728634/141726309-2cf4d03f-aa1b-4422-b8bd-2c40bc9be5f4.jpg) ![bv_scaling_simulation_times_mean_single](https://user-images.githubusercontent.com/48728634/141726321-2eb9badc-68a4-491f-918b-2be7fa9c8e67.jpg) ![bv_scaling_total_dry_time_single](https://user-images.githubusercontent.com/48728634/141726326-f7416d0d-9ec1-47de-b45e-796c6ca3917a.jpg) ![bv_scaling_total_simulation_time_single](https://user-images.githubusercontent.com/48728634/141726330-b3165c13-2f96-4d85-ba8a-8d0e3c234989.jpg)
qv ![qv_scaling_dry_run_time_single](https://user-images.githubusercontent.com/48728634/141726485-0c59a33a-3c82-4592-8648-101363e75631.jpg) ![qv_scaling_simulation_times_mean_single](https://user-images.githubusercontent.com/48728634/141726493-d483a622-2713-4387-8a04-eb05554dce33.jpg) ![qv_scaling_total_dry_time_single](https://user-images.githubusercontent.com/48728634/141726503-e27d1e98-23ba-44c2-8e4a-187db6ff7a59.jpg) ![qv_scaling_total_simulation_time_single](https://user-images.githubusercontent.com/48728634/141726521-cdc5053f-9982-4589-90cd-3fd636b48a98.jpg)

Some comments:

EDIT: I will also prepare some benchmarks with CPU.

scarrazza commented 2 years ago

@mlazzarin thanks for these tests. The cuQuantum is using a single GPU device, correct?

mlazzarin commented 2 years ago

The cuQuantum is using a single GPU device, correct?

Yes, I'm using the machine with a single NVIDIA RTX A6000. By the way, I'm not sure if qsim supports multi-GPU.

scarrazza commented 2 years ago

Ok, thanks, anyway quite good to see that we are strong XD.

mlazzarin commented 2 years ago

Here's the results for CPU. For qsim I'm using a number of threads equal to the number of logical cores, while for qibo a kepy the default value, which is half of the logical cores. (I also tried with all logical cores and it's actually slower, for small circuits)

qft ![qft_scaling_dry_run_time_single](https://user-images.githubusercontent.com/48728634/141776204-c5b73770-25e7-4f9d-beff-f8466829b253.jpg) ![qft_scaling_simulation_times_mean_single](https://user-images.githubusercontent.com/48728634/141776215-318b50b4-35d2-434a-8061-8a81b8c489e3.jpg) ![qft_scaling_total_dry_time_single](https://user-images.githubusercontent.com/48728634/141776232-963e62de-857b-40bd-89a1-9449730e4f32.jpg) ![qft_scaling_total_simulation_time_single](https://user-images.githubusercontent.com/48728634/141776244-864155d4-fd88-424e-bb45-6ab3d0f12939.jpg)
variational ![variational_scaling_dry_run_time_single](https://user-images.githubusercontent.com/48728634/141776397-91a10d06-e462-41f2-a8af-49be42b7340f.jpg) ![variational_scaling_simulation_times_mean_single](https://user-images.githubusercontent.com/48728634/141776403-70fe9adb-754b-4a77-8dce-d7152710f5a7.jpg) ![variational_scaling_total_dry_time_single](https://user-images.githubusercontent.com/48728634/141776437-f6a50539-c19f-49b2-8a26-df8fe09c0b93.jpg) ![variational_scaling_total_simulation_time_single](https://user-images.githubusercontent.com/48728634/141776438-004e4cec-b87d-46ea-b034-643ceb695c57.jpg)
supremacy ![supremacy_scaling_dry_run_time_single](https://user-images.githubusercontent.com/48728634/141776348-e476539a-7bcd-492e-a08f-428c081683e8.jpg) ![supremacy_scaling_simulation_times_mean_single](https://user-images.githubusercontent.com/48728634/141776360-318cfb71-8c3d-4e2a-ba38-816c9b322bca.jpg) ![supremacy_scaling_total_dry_time_single](https://user-images.githubusercontent.com/48728634/141776364-5f356fa2-aa2f-47e5-be19-41ec0ba310ef.jpg) ![supremacy_scaling_total_simulation_time_single](https://user-images.githubusercontent.com/48728634/141776370-a4911d08-2ed3-4235-ae25-2b640e7d5c33.jpg)
bv ![bv_scaling_dry_run_time_single](https://user-images.githubusercontent.com/48728634/141776136-e929da3b-9c59-48db-8bb2-876352cbf0c6.jpg) ![bv_scaling_simulation_times_mean_single](https://user-images.githubusercontent.com/48728634/141776141-ad42c221-5b64-4b73-8ba5-2d83ac7316d9.jpg) ![bv_scaling_total_dry_time_single](https://user-images.githubusercontent.com/48728634/141776157-4e9d4ae7-ecc2-45f2-9368-187d55e83049.jpg) ![bv_scaling_total_simulation_time_single](https://user-images.githubusercontent.com/48728634/141776181-3d5721ee-52f4-4b53-b8b1-512b20b37802.jpg)
qv ![qv_scaling_dry_run_time_single](https://user-images.githubusercontent.com/48728634/141776289-9400b16a-24ec-4c72-a32b-39655260c0a1.jpg) ![qv_scaling_simulation_times_mean_single](https://user-images.githubusercontent.com/48728634/141776297-bc157e41-56cf-492b-a51d-f2cae5dd9272.jpg) ![qv_scaling_total_dry_time_single](https://user-images.githubusercontent.com/48728634/141776305-c3c6be72-cd5c-4bba-ae63-7973afecbc91.jpg) ![qv_scaling_total_simulation_time_single](https://user-images.githubusercontent.com/48728634/141776315-6e8855c3-5ffd-48b9-8748-9991238e2cf2.jpg)

Two comments:

scarrazza commented 2 years ago

This sounds really like there is circuit fusion, maybe we should try to activate from qibojit and see what happens.

mlazzarin commented 2 years ago

Ok, I'm on it.

mlazzarin commented 2 years ago

Here's the results for CPU with gate fusion up to two-qubit gates and using all threads. Indeed the situation now is different.

qft - CPU ![qft_scaling_dry_run_time_single](https://user-images.githubusercontent.com/48728634/141986191-eb9820da-5da1-4ac5-b873-07bd475fdc98.jpg) ![qft_scaling_simulation_times_mean_single](https://user-images.githubusercontent.com/48728634/141986198-533589f0-2b0c-428c-8d72-d76122c3b1c8.jpg) ![qft_scaling_total_dry_time_single](https://user-images.githubusercontent.com/48728634/141986207-aaa44e1b-b7b4-4d23-a915-8ecdd8db4d24.jpg) ![qft_scaling_total_simulation_time_single](https://user-images.githubusercontent.com/48728634/141986210-134f9f83-fa48-4f2e-9875-e4c9424cac5e.jpg)
variational - CPU ![variational_scaling_dry_run_time_single](https://user-images.githubusercontent.com/48728634/141986317-5c0b9425-f9a2-443b-ae77-79e5139c6d48.jpg) ![variational_scaling_simulation_times_mean_single](https://user-images.githubusercontent.com/48728634/141986322-7a65e187-ad12-48b9-884f-69dfeb521ad8.jpg) ![variational_scaling_total_dry_time_single](https://user-images.githubusercontent.com/48728634/141986328-81fd26cd-8ae5-4eef-bab7-99d6f53545ba.jpg) ![variational_scaling_total_simulation_time_single](https://user-images.githubusercontent.com/48728634/141986330-98ced9b8-ce18-43c1-95da-05b29ec16675.jpg)
supremacy - CPU ![supremacy_scaling_dry_run_time_single](https://user-images.githubusercontent.com/48728634/141986288-f90e57ef-6fe7-4a01-bc1f-6222a8975366.jpg) ![supremacy_scaling_simulation_times_mean_single](https://user-images.githubusercontent.com/48728634/141986295-72bb5a19-a98e-4e8e-b9df-d06bf7654ca8.jpg) ![supremacy_scaling_total_dry_time_single](https://user-images.githubusercontent.com/48728634/141986300-c632eb9c-54a8-4abf-884e-71074d87b060.jpg) ![supremacy_scaling_total_simulation_time_single](https://user-images.githubusercontent.com/48728634/141986302-d0320cd2-d40a-4a18-9cb5-2bbb2d921b21.jpg)
bv - CPU ![bv_scaling_dry_run_time_single](https://user-images.githubusercontent.com/48728634/141986142-e0b2938f-6eb4-4632-83e0-f6ebb8ee76a4.jpg) ![bv_scaling_simulation_times_mean_single](https://user-images.githubusercontent.com/48728634/141986149-83b46202-dfda-451b-84e0-438af35a4bd4.jpg) ![bv_scaling_total_dry_time_single](https://user-images.githubusercontent.com/48728634/141986159-855065fd-d6fc-401f-afa7-eb529f10b13f.jpg) ![bv_scaling_total_simulation_time_single](https://user-images.githubusercontent.com/48728634/141986170-79d9a785-01b3-4aa9-b26d-3d765295b81b.jpg)
qv - CPU ![qv_scaling_dry_run_time_single](https://user-images.githubusercontent.com/48728634/141986243-3e8eb637-a5b1-4f49-8663-abbe9d82cd1f.jpg) ![qv_scaling_simulation_times_mean_single](https://user-images.githubusercontent.com/48728634/141986245-336700de-1e9a-4bbe-a776-df17893893cc.jpg) ![qv_scaling_total_dry_time_single](https://user-images.githubusercontent.com/48728634/141986253-b7cdfb08-5f98-45d4-86a1-2fc6f2e303f7.jpg) ![qv_scaling_total_simulation_time_single](https://user-images.githubusercontent.com/48728634/141986260-9a8846fd-9603-4452-be8a-40d06639362b.jpg)

I re-run the GPU benchmarks with gate fusion up to two-qubit gates, and now qibojit seems a bit faster.

qft - GPU ![qft_scaling_dry_run_time_single](https://user-images.githubusercontent.com/48728634/141986959-aecf4f37-7795-471d-9d2d-1177a5a4da56.jpg) ![qft_scaling_simulation_times_mean_single](https://user-images.githubusercontent.com/48728634/141986966-1f0eb3fc-133b-4662-bcf3-48adad059969.jpg) ![qft_scaling_total_dry_time_single](https://user-images.githubusercontent.com/48728634/141986986-8f1048a9-ff71-4874-b277-b2892f8380ad.jpg) ![qft_scaling_total_simulation_time_single](https://user-images.githubusercontent.com/48728634/141986992-fba8be98-2a00-4b6d-b051-d6fce1ec6778.jpg)
variational - GPU ![variational_scaling_dry_run_time_single](https://user-images.githubusercontent.com/48728634/141987066-e8387b56-b679-4f0c-8b42-867573a783e0.jpg) ![variational_scaling_simulation_times_mean_single](https://user-images.githubusercontent.com/48728634/141987068-23baecef-70d0-4f1e-8822-10e38bddb2ff.jpg) ![variational_scaling_total_dry_time_single](https://user-images.githubusercontent.com/48728634/141987069-845190cd-4aef-4a7b-b44d-971186c6418a.jpg) ![variational_scaling_total_simulation_time_single](https://user-images.githubusercontent.com/48728634/141987073-74cb7497-2da1-485c-b3f2-71ab098f58e0.jpg)
supremacy - GPU ![supremacy_scaling_dry_run_time_single](https://user-images.githubusercontent.com/48728634/141987045-381684b9-f8f7-4b51-8e5b-e4d9f9bb080a.jpg) ![supremacy_scaling_simulation_times_mean_single](https://user-images.githubusercontent.com/48728634/141987048-55d357df-0f13-4d1d-934e-1b3dceda4fac.jpg) ![supremacy_scaling_total_dry_time_single](https://user-images.githubusercontent.com/48728634/141987051-32b2e04f-d23e-43d6-b5ff-60c903b39756.jpg) ![supremacy_scaling_total_simulation_time_single](https://user-images.githubusercontent.com/48728634/141987056-c68b3f1c-f352-43ac-82cf-cdb55891b52c.jpg)
bv - GPU ![bv_scaling_dry_run_time_single](https://user-images.githubusercontent.com/48728634/141986905-8ae0930f-8265-49d4-a3ff-45986ed824f9.jpg) ![bv_scaling_simulation_times_mean_single](https://user-images.githubusercontent.com/48728634/141986914-bd0280de-d415-41aa-9449-9c9ca249e350.jpg) ![bv_scaling_total_dry_time_single](https://user-images.githubusercontent.com/48728634/141986938-3f1ff047-6dae-4c50-842c-e5418d3ab056.jpg) ![bv_scaling_total_simulation_time_single](https://user-images.githubusercontent.com/48728634/141986947-652585c7-b782-4030-b80a-9936894a1bfd.jpg)
qv - GPU ![qv_scaling_dry_run_time_single](https://user-images.githubusercontent.com/48728634/141987016-0587d8cf-d90c-400b-9f99-6e637ec91b32.jpg) ![qv_scaling_simulation_times_mean_single](https://user-images.githubusercontent.com/48728634/141987021-531179d2-2844-4bd6-a5c2-9f6e4036eedf.jpg) ![qv_scaling_total_dry_time_single](https://user-images.githubusercontent.com/48728634/141987027-aa26a70b-f65f-406a-9e3c-fbba608f6814.jpg) ![qv_scaling_total_simulation_time_single](https://user-images.githubusercontent.com/48728634/141987031-11b7aa34-cf1f-4a58-a757-a976dfc136f4.jpg)
scarrazza commented 2 years ago

Cool, however would be great to understand if/how they are doing the gate fusion.

mlazzarin commented 2 years ago

Cool, however would be great to understand if/how they are doing the gate fusion.

With qsim there is an option to set the maximum size of fused gates. In the last benchmarks that I posted I set that value to 2 (which is the default value). I've not found a specific flag to disable gate fusion, so in the other benchmarks I simply set that value to 0, but I don't know if it actually disable fusion or not. Concerning how they do fusion, their approach is described here https://arxiv.org/abs/2111.02396 .

scarrazza commented 2 years ago

Ok, so these last plots are comparing like with like, good.

mlazzarin commented 2 years ago

I double-checked and I believe that this implementation is the optimal one, so we may proceed with the review and then merge it to the library branch. I have only two comment left:

mlazzarin commented 2 years ago

I fixed some gates in Cirq, now the CI works fine. Once we fix the tests for the gates, we should review each library to ensure that everything is properly implemented.

mlazzarin commented 2 years ago

Shall we merge this?

stavros11 commented 2 years ago

Yes, please go ahead and merge this and I will update randomtests to use the latest libraries so that we find any issues with gates.