Closed stavros11 closed 2 years ago
Here are some results on the DGX in double precision:
The issues to be resolved from the qibo side are the following:
Despite these issues, it seems that qibojit generally gets better simulation times than qibotf, most likely due to faster CPU-GPU communication.
In order to demonstrate the second issue with parallelization, here are some plots that show the GPU utilization as captured from nvidia-smi
every short time interval (0.05 sec). For qibotf we see that the two GPUs are working simultaneously, while for qibojit most operations seem to be applied sequentially.
I suppose the joblib configuration is the same between both backends, right? If that is the case, then maybe there is some extra cuda sync which is blocking the operation in qibojit.
I suppose the joblib configuration is the same between both backends, right?
Yes, the multigpu circuit is defined in qibo and is the same for both backends. This is the only place where joblib
is used.
If that is the case, then maybe there is some extra cuda sync which is blocking the operation in qibojit.
If you mean the cp.cuda.stream.get_current_stream().synchronize()
that we do in cupy backends, I tried removing it from everywhere (all qibo and qibojit tests still pass without the sync), but the multigpu situation remains the same. Also the OOM issue in the second QFT run remains.
I explored the memory issue a bit more and found the following:
from qibo.models import QFT
c = models.QFT(31, accelerators={"/GPU:0": 1, "/GPU:1": 1})
final_state = c().numpy() # dry run
final_state = c().numpy() # second run
the OOM does not appear. More explicitly, if we define the QFT as:
circuit = Circuit(nqubits)
for i1 in range(nqubits):
circuit.add(gates.H(i1))
for i2 in range(i1 + 1, nqubits):
theta = math.pi / 2 ** (i2 - i1)
circuit.add(gates.CU1(i2, i1, theta))
for i in range(nqubits // 2):
circuit.add(gates.SWAP(i, nqubits - i - 1))
then the OOM / double memory issue appears, while if we define it using the _DistributedQFT
method from qibo the OOM does not appear. Note that the two circuits are equivalent, just some commuting gates are reordered and I checked that the final states of both are the same, even with random initial state. I am not sure if the issue appears with other circuits too.
Interesting, so the state vector is not being cleaned between runs. What happens if you force the state vector delete between the dry-run and execution using the code from this repo?
Interesting, so the state vector is not being cleaned between runs. What happens if you force the state vector delete between the dry-run and execution using the code from this repo?
All the issues I wrote above are the same regardless of whether I delete the result (the execution output) after each run. I would assume that there is a bug and in the second circuit, which creates the problem, a reference to the state remains somewhere and that's why is not properly cleaned but this does not explain why the problem does not appear with qibotf. Also the problem remains even if I delete both the result and the whole circuit after each run.
Did you try with cp._default_memory_pool.free_all_blocks()
?
Did you try with
cp._default_memory_pool.free_all_blocks()
?
Yes, I also tried that one after object deletion and it does not make a difference.
Here are some benchmarks with the latest version of qibojit, after merging the memory duplication fix.
Note that some other things are running in parallel in the machine, so there may be some noise and we cannot do a fair comparison with the tables above. I ran qibojit and qibotf sequentially though so that at least we can compare the two.
There appears to be an issue with the 4x GPU run of qibojit, I will rerun this to confirm whether it is something code related or just something temporary with the machine. Other than that qibojit times, even for dry run, are competitive with qibotf.
Thanks @stavros11, performance is quite good, despite the unsync initial step.
@scarrazza, I updated the post above with the latest results for three different circuits. I believe qibojit performance is acceptable compared to qibotf, even for dry run. Perhaps it is possible to improve with a better multigpu approach but it may be interesting to compare with cuquantum first, after they release.
Looks good to me, can we merge?
Looks good to me, can we merge?
Yes, this is okay from my side.
@mlazzarin I realized that I had an old implementation of multigpu benchmarks for qibo in this repo and I updated using the latest
libraries
branch to avoid doing double work. The multigpu configuration can be passed in--library-options
using the existing benchmark scripts, for example:Let me know if you agree. Note that you can "reuse" a single GPU by passing
accelerators=2/GPU:0
(works for >2 times too, but should be power of 2).I checked running the above for a few configurations and it seems to execute for both qibojit and qibotf. I am not sure if the results make sense, though. I suspect that parallelization when multiple GPUs are used is broken for qibojit, because for qibotf I get 100% simultaneously on all devices on nvidia-smi, while for qibojit it seems to run sequentially. I will investigate this further. We also need to add tests that check the final state for multi-gpu configurations here.