Closed mlazzarin closed 2 years ago
@mlazzarin thanks for these tests. The cuQuantum is using a single GPU device, correct?
The cuQuantum is using a single GPU device, correct?
Yes, I'm using the machine with a single NVIDIA RTX A6000. By the way, I'm not sure if qsim supports multi-GPU.
Ok, thanks, anyway quite good to see that we are strong XD.
Here's the results for CPU. For qsim
I'm using a number of threads equal to the number of logical cores, while for qibo
a kepy the default value, which is half of the logical cores. (I also tried with all logical cores and it's actually slower, for small circuits)
Two comments:
qsim
is usually faster than qibo
with large circuits, except for the qft, while qibo
seems competitive with smaller circuits.qsim
. I simply set the max_fused_gate_size
parameter to 0, because I didn't find a flag to disable fusion completely.This sounds really like there is circuit fusion, maybe we should try to activate from qibojit and see what happens.
Ok, I'm on it.
Here's the results for CPU with gate fusion up to two-qubit gates and using all threads. Indeed the situation now is different.
I re-run the GPU benchmarks with gate fusion up to two-qubit gates, and now qibojit seems a bit faster.
Cool, however would be great to understand if/how they are doing the gate fusion.
Cool, however would be great to understand if/how they are doing the gate fusion.
With qsim
there is an option to set the maximum size of fused gates. In the last benchmarks that I posted I set that value to 2 (which is the default value). I've not found a specific flag to disable gate fusion, so in the other benchmarks I simply set that value to 0, but I don't know if it actually disable fusion or not.
Concerning how they do fusion, their approach is described here https://arxiv.org/abs/2111.02396 .
Ok, so these last plots are comparing like with like, good.
I double-checked and I believe that this implementation is the optimal one, so we may proceed with the review and then merge it to the library
branch. I have only two comment left:
denormals_are_zeros: if true, set flush-to-zero and denormals-are-zeros
MXCSR control flags. This prevents rare cases of performance
slowdown potentially at the cost of a tiny precision loss.
I'm not sure if we should use it in the benchmarks or not.
I fixed some gates in Cirq, now the CI works fine. Once we fix the tests for the gates, we should review each library to ensure that everything is properly implemented.
Shall we merge this?
Yes, please go ahead and merge this and I will update randomtests
to use the latest libraries
so that we find any issues with gates.
In this PR I added
qsim
(cpu),qsim-gpu
andqsim-cuquantum
. Forqsim
(cpu) I set the number of threads tomultiprocessing.cpu_count()
. For all, I set themax_fused_gate_size
to zero. EDIT: For ``qibojit```, I disabled to compilation during import.I also performed some benchmarks (cupy 9.6.0, cuda toolkit 11.5) for gpu.
total_dry_time
: import + creation + dry runtotal_simulation_time
: import + creation + simulation timeqft
![qft_scaling_dry_run_time_single](https://user-images.githubusercontent.com/48728634/141726342-ad1a9da0-66c8-42ec-b9af-ef782f1fef8e.jpg) ![qft_scaling_simulation_times_mean_single](https://user-images.githubusercontent.com/48728634/141726347-683db5d6-2044-4e5d-8bee-b1b8c0951ccc.jpg) ![qft_scaling_total_dry_time_single](https://user-images.githubusercontent.com/48728634/141726362-fefd9e82-a6a6-4552-9e65-70fb81b5a677.jpg) ![qft_scaling_total_simulation_time_single](https://user-images.githubusercontent.com/48728634/141726364-20dcbf0a-dfff-44af-a81f-24d2bfff6ac0.jpg)variational
![variational_scaling_dry_run_time_single](https://user-images.githubusercontent.com/48728634/141726601-3484edf1-8f5e-47ec-bb12-01f6cb7a5399.jpg) ![variational_scaling_simulation_times_mean_single](https://user-images.githubusercontent.com/48728634/141726602-552a3014-1afb-4046-a6b6-d028d912f5e5.jpg) ![variational_scaling_total_dry_time_single](https://user-images.githubusercontent.com/48728634/141726606-09ddcb9c-5d45-404d-93a4-4180efc8871d.jpg) ![variational_scaling_total_simulation_time_single](https://user-images.githubusercontent.com/48728634/141726608-53731cfa-3510-46cf-85c1-d65ae42aa619.jpg)supremacy
![supremacy_scaling_dry_run_time_single](https://user-images.githubusercontent.com/48728634/141726549-7731832a-3c7a-4fe2-b3dc-e23db9bb2275.jpg) ![supremacy_scaling_simulation_times_mean_single](https://user-images.githubusercontent.com/48728634/141726556-d754ff09-62c9-4344-85f9-a609480a64bb.jpg) ![supremacy_scaling_total_dry_time_single](https://user-images.githubusercontent.com/48728634/141726567-a00f18f7-c5ce-4bb6-a0dd-55a0da001edf.jpg) ![supremacy_scaling_total_simulation_time_single](https://user-images.githubusercontent.com/48728634/141726575-bafb322f-fd9b-4c8f-8e68-a4677aa9db16.jpg)bv
![bv_scaling_dry_run_time_single](https://user-images.githubusercontent.com/48728634/141726309-2cf4d03f-aa1b-4422-b8bd-2c40bc9be5f4.jpg) ![bv_scaling_simulation_times_mean_single](https://user-images.githubusercontent.com/48728634/141726321-2eb9badc-68a4-491f-918b-2be7fa9c8e67.jpg) ![bv_scaling_total_dry_time_single](https://user-images.githubusercontent.com/48728634/141726326-f7416d0d-9ec1-47de-b45e-796c6ca3917a.jpg) ![bv_scaling_total_simulation_time_single](https://user-images.githubusercontent.com/48728634/141726330-b3165c13-2f96-4d85-ba8a-8d0e3c234989.jpg)qv
![qv_scaling_dry_run_time_single](https://user-images.githubusercontent.com/48728634/141726485-0c59a33a-3c82-4592-8648-101363e75631.jpg) ![qv_scaling_simulation_times_mean_single](https://user-images.githubusercontent.com/48728634/141726493-d483a622-2713-4387-8a04-eb05554dce33.jpg) ![qv_scaling_total_dry_time_single](https://user-images.githubusercontent.com/48728634/141726503-e27d1e98-23ba-44c2-8e4a-187db6ff7a59.jpg) ![qv_scaling_total_simulation_time_single](https://user-images.githubusercontent.com/48728634/141726521-cdc5053f-9982-4589-90cd-3fd636b48a98.jpg)Some comments:
EDIT: I will also prepare some benchmarks with CPU.