Add multi-GPU option for qibo library

stavros11 commented 2 years ago

@mlazzarin I realized that I had an old implementation of multigpu benchmarks for qibo in this repo and I updated using the latest libraries branch to avoid doing double work. The multigpu configuration can be passed in --library-options using the existing benchmark scripts, for example:

python compare.py --library qibo --nqubits 31 --library-options accelerators=1/GPU:0+1/GPU:1

Let me know if you agree. Note that you can "reuse" a single GPU by passing accelerators=2/GPU:0 (works for >2 times too, but should be power of 2).

I checked running the above for a few configurations and it seems to execute for both qibojit and qibotf. I am not sure if the results make sense, though. I suspect that parallelization when multiple GPUs are used is broken for qibojit, because for qibotf I get 100% simultaneously on all devices on nvidia-smi, while for qibojit it seems to run sequentially. I will investigate this further. We also need to add tests that check the final state for multi-gpu configurations here.

stavros11 commented 2 years ago

Here are some results on the DGX in double precision:

QFT

| nqubits (accelerators) | qibojit dry run (sec) | qibojit simulation (sec) | qibotf dry run (sec) | qibottf simulation (sec) | |:-------------------------------------|-----------------------:|--------------------------------:|----------------------:|-------------------------------:| | 31 (2/GPU:0) | 75.1929 | 68.7758 | 82.987 | 83.3789 | | 31 (1/GPU:0 + 1/GPU:1) | 67.636 | OOM | 55.9763 | 55.613 | | 32 (4/GPU:0) | 195.4 | 188.271 | 219.633 | 220.952 | | 32 (2/GPU:0 + 2/GPU:1) | 155.852 | 137.265 | 137.71 | 138.196 | | 32 (1/GPU:0 + 1/GPU:1 + 1/GPU:2 + 1/GPU:3) | 135.925 | OOM | 96.0887 | 94.1834 |

Variational

| nqubits (accelerators) | qibojit dry run (sec) | qibojit simulation (sec) | qibotf dry run (sec) | qibottf simulation (sec) | |:-------------------------------------|-----------------------:|--------------------------------:|----------------------:|-------------------------------:| | 31 (2/GPU:0) | 61.958 | 55.7334 | 74.2878 | 75.3073 | | 31 (1/GPU:0 + 1/GPU:1) | 59.6055 | 38.3918 | 50.0096 | 50.3635 | | 32 (4/GPU:0) | 167.434 | 160.963 | 217.574 | 217.58 | | 32 (2/GPU:0 + 2/GPU:1) | 127.991 | 109.543 | 146.671 | 148.087 | | 32 (1/GPU:0 + 1/GPU:1 + 1/GPU:2 + 1/GPU:3) | 121.044 | 75.4436 | 111.084 | 110.45 |

The issues to be resolved from the qibo side are the following:

[ ] Why the second run in QFT runs out of memory when many devices are used, while the same run works for qibotf. I suspect that some objects are not deleted properly but it is strange that the same issue does not appear in the variational circuit.
[ ] It seems that some kind compilation is happening in qibojit dry run as in all cases it is much slower than simulation, in contrast to qibotf where both runs are similar. I also believe that the joblib parallelization does not work very well with qibojit.

Despite these issues, it seems that qibojit generally gets better simulation times than qibotf, most likely due to faster CPU-GPU communication.

stavros11 commented 2 years ago

In order to demonstrate the second issue with parallelization, here are some plots that show the GPU utilization as captured from nvidia-smi every short time interval (0.05 sec). For qibotf we see that the two GPUs are working simultaneously, while for qibojit most operations seem to be applied sequentially.

qibojit - QFT - 31 qubits

![image](https://user-images.githubusercontent.com/35475381/149088396-be03dacf-d6c1-41e0-9a4c-6184e1ebb9ad.png)

qibotf - QFT - 31 qubits

![image](https://user-images.githubusercontent.com/35475381/149088435-d8354746-8473-4e40-9b18-06ec7de2592e.png)

qibojit - Variational - 31 qubits

![image](https://user-images.githubusercontent.com/35475381/149088416-b7d57251-9b1f-4c73-9a60-8efecf265000.png)

qibotf - Variational - 31 qubits

![image](https://user-images.githubusercontent.com/35475381/149088465-380acf00-d6fd-4f67-a4f4-f6de5aeea392.png)

scarrazza commented 2 years ago

I suppose the joblib configuration is the same between both backends, right? If that is the case, then maybe there is some extra cuda sync which is blocking the operation in qibojit.

stavros11 commented 2 years ago

I suppose the joblib configuration is the same between both backends, right?

Yes, the multigpu circuit is defined in qibo and is the same for both backends. This is the only place where joblib is used.

If that is the case, then maybe there is some extra cuda sync which is blocking the operation in qibojit.

If you mean the cp.cuda.stream.get_current_stream().synchronize() that we do in cupy backends, I tried removing it from everywhere (all qibo and qibojit tests still pass without the sync), but the multigpu situation remains the same. Also the OOM issue in the second QFT run remains.

stavros11 commented 2 years ago

I explored the memory issue a bit more and found the following:

It happens with smaller qubit numbers. In this case there's no OOM but additional GPU memory is occupied. For example for 28 qubits distributed to two physical GPUs, the dry run occupies 2GB in both GPUs, while the second run occupies 4GB in the first GPU and 2GB in the second. It seems that the state is not cleaned properly in the first GPU after the dry run.
If I do three or more repetitions (dry run + two more reps) for 28 qubits, memory occupation remains 4GB + 2GB during all reps. So the non-cleaning issue only happens in the dry run, additional reps do not increase memory.
The issue appears for the QFT circuit and only when the definition used in the current repo is used. For example, if we do
```
from qibo.models import QFT
c = models.QFT(31, accelerators={"/GPU:0": 1, "/GPU:1": 1})
final_state = c().numpy()  # dry run
final_state = c().numpy()  # second run
```
the OOM does not appear. More explicitly, if we define the QFT as:
```
circuit = Circuit(nqubits)
for i1 in range(nqubits):
circuit.add(gates.H(i1))
for i2 in range(i1 + 1, nqubits):
    theta = math.pi / 2 ** (i2 - i1)
   circuit.add(gates.CU1(i2, i1, theta))
for i in range(nqubits // 2):
circuit.add(gates.SWAP(i, nqubits - i - 1))
```
then the OOM / double memory issue appears, while if we define it using the _DistributedQFT method from qibo the OOM does not appear. Note that the two circuits are equivalent, just some commuting gates are reordered and I checked that the final states of both are the same, even with random initial state. I am not sure if the issue appears with other circuits too.

scarrazza commented 2 years ago

Interesting, so the state vector is not being cleaned between runs. What happens if you force the state vector delete between the dry-run and execution using the code from this repo?

stavros11 commented 2 years ago

Interesting, so the state vector is not being cleaned between runs. What happens if you force the state vector delete between the dry-run and execution using the code from this repo?

All the issues I wrote above are the same regardless of whether I delete the result (the execution output) after each run. I would assume that there is a bug and in the second circuit, which creates the problem, a reference to the state remains somewhere and that's why is not properly cleaned but this does not explain why the problem does not appear with qibotf. Also the problem remains even if I delete both the result and the whole circuit after each run.

scarrazza commented 2 years ago

Did you try with cp._default_memory_pool.free_all_blocks()?

stavros11 commented 2 years ago

Did you try with cp._default_memory_pool.free_all_blocks()?

Yes, I also tried that one after object deletion and it does not make a difference.

stavros11 commented 2 years ago

Here are some benchmarks with the latest version of qibojit, after merging the memory duplication fix.

qft

| nqubits (accelerators) | dry_run_time_qibojit | simulation_times_mean_qibojit | dry_run_time_qibotf | simulation_times_mean_qibotf | |:-------------------------------------|-----------------------:|--------------------------------:|----------------------:|-------------------------------:| | 31 (2/GPU:3) | 75.1272 | 68.4773 | 84.8177 | 85.2387 | | 31 (1/GPU:2+1/GPU:3) | 56.0591 | 46.028 | 56.27 | 55.7507 | | 32 (4/GPU:3) | 199.658 | 190.287 | 220.224 | 220.174 | | 32 (2/GPU:2+2/GPU:3) | 144.643 | 129.706 | 138.685 | 141.355 | | 32 (1/GPU:0+1/GPU:1+1/GPU:2+1/GPU:3) | 125.408 | 98.043 | 99.0873 | 96.8065 |

variational

| nqubits (accelerators) | dry_run_time_qibojit | simulation_times_mean_qibojit | dry_run_time_qibotf | simulation_times_mean_qibotf | |:-------------------------------------|-----------------------:|--------------------------------:|----------------------:|-------------------------------:| | 31 (2/GPU:3) | 61.8927 | 56.3496 | 75.4588 | 76.1184 | | 31 (1/GPU:2+1/GPU:3) | 48.8197 | 44.9677 | 51.4959 | 51.3772 | | 32 (4/GPU:3) | 177.81 | 171.728 | 227.636 | 230.986 | | 32 (2/GPU:2+2/GPU:3) | 140.191 | 134.599 | 157.079 | 159.028 | | 32 (1/GPU:0+1/GPU:1+1/GPU:2+1/GPU:3) | 125.215 | 90.3429 | 120.335 | 118.85 |

supremacy

| nqubits (accelerators) | dry_run_time_qibojit | simulation_times_mean_qibojit | dry_run_time_qibotf | simulation_times_mean_qibotf | |:-------------------------------------|-----------------------:|--------------------------------:|----------------------:|-------------------------------:| | 31 (2/GPU:3) | 64.1213 | 57.8321 | 76.3821 | 76.6204 | | 31 (1/GPU:2+1/GPU:3) | 49.0247 | 38.3491 | 51.978 | 51.9739 | | 32 (4/GPU:3) | 179.127 | 173.83 | 232.332 | 233.252 | | 32 (2/GPU:2+2/GPU:3) | 131.618 | 120.149 | 157.172 | 158.845 | | 32 (1/GPU:0+1/GPU:1+1/GPU:2+1/GPU:3) | 120.386 | 92.9109 | 118.797 | 120.125 |

Note that some other things are running in parallel in the machine, so there may be some noise and we cannot do a fair comparison with the tables above. I ran qibojit and qibotf sequentially though so that at least we can compare the two.

There appears to be an issue with the 4x GPU run of qibojit, I will rerun this to confirm whether it is something code related or just something temporary with the machine. Other than that qibojit times, even for dry run, are competitive with qibotf.

scarrazza commented 2 years ago

Thanks @stavros11, performance is quite good, despite the unsync initial step.

stavros11 commented 2 years ago

@scarrazza, I updated the post above with the latest results for three different circuits. I believe qibojit performance is acceptable compared to qibotf, even for dry run. Perhaps it is possible to improve with a better multigpu approach but it may be interesting to compare with cuquantum first, after they release.

scarrazza commented 2 years ago

Looks good to me, can we merge?

stavros11 commented 2 years ago

Looks good to me, can we merge?

Yes, this is okay from my side.

qiboteam / qibojit-benchmarks

Add multi-GPU option for qibo library #28