Closed scarrazza closed 4 years ago
Some possible approaches are presented in:
I would like to have a go on this over the next few days. I am planning to start with an "experimental" implementation based on communication protocol of qHiPSTER, mostly to get familiar with how mutliprocessing works on Tensorflow. If everything goes well, we can then try other more memory efficient protocols.
Looking at VegasFlow code it seems that we can use tf.device
to execute code in the appropriate device. There are two questions related to that. Assume that we have partitioned a n-qubit state to two (n-1)-qubit states state0
and state1
saved on two GPUs.
Any gate that does not involve qubit-0 can be applied without communication as:
with tf.device('/gpu:0'):
state0 = gate(state0)
with tf.device('/gpu:1'):
state1 = gate(state1)
Does this run in parallel in both devices?
A gate that involves qubit 0 requires communication. Is
with tf.device('/gpu:0'):
state0 = gate(state0, state1)
the proper way to do this or shall we first transfer manually state1
to gpu:0
somehow?
The qHiPSTER approach seems a reasonable starting point. At some point I was also thinking about porting basic linear algebra operations to multi-GPU, i.e. implementing the einsum
or matmul
for larger matrices.
Concerning your points:
joblib
. For example, we can allocate a pool and launch the graphs with:
@tf.function
def run_job(input_state, device):
with tf.device(device):
state = gate(input_state)
return state
pool = joblib.Parallel(n_jobs=2, prefer="threads") config = zip([state0, state1],['/gpu:0','/gpu:1']) result = pool(joblib.delayed(run_job)(state, device) for state, device in config)
with tf.device('/cpu:0'): final = gate(result[0], result[1])
Other techniques are possible, however they are experimental and designed for high level ML training, see https://www.tensorflow.org/guide/distributed_training.
2. Yes, but I believe in this context you may need to perform this operation on CPU, due to memory limitations.
I have played a little bit with the device manipulation with large objects.
tf.device('gpu')
then tf copies the object to CPU RAMtf.device('gpu:1')
the program tries to do not populate the CPU RAM during the transfer.del
and setting the object to None
.I have also been playing around a bit with this and although I like the idea of avoiding to do the state manipulation manually, I cannot find a way to implement this in practice. Even if we attempt to do a multi-GPU einsum implementation as a third einsum backend, this should split the state internally and call tf.einsum
on the pieces, something like:
def distributed_einsum(state, gate):
einsum_str = # calculate appropriate indices
with tf.device("/GPU:0"):
state0 = tf.einsum(einsum_str, state[0], gate)
with tf.device("/GPU:1"):
state1 = tf.einsum(einsum_str, state[1], gate)
or the same parallelized with joblib
as above. This approach seems to work even when state
does not fit in the a memory of a single GPU. Eg. for 27 qubits doing tf.einsum(..., state, gate)
runs out of memory but tf.einsum(..., state[0], gate)
works.
The next question is whether the above function should return state
or [state0, state1]
. If we want to keep our current code it should return the full state
, which means that we need an additional stitching step: state = tf.concat([state0[tf.newaxis], state1[tf.newaxis]], axis=0)
. There are several issues with this:
[state0, state1]
. The test I did so far is generating a random state and applying H to all qubits. For 27 qubits this takes ~5sec without the tf.concat
step and 20sec with tf.concat
. Note however that this test is biased because it does not require changing the "global" qubit a lot. I plan to implement a fair comparison to check whether the difference remains that high.
Indeed that is the main point we have to decide. Concerning your test, how long the same calculation takes on CPU-only? I believe this extra 20s are quite acceptable when comparing to CPU.
Now in terms of strategy, I think that we can start from the assumption the CPU system has enough memory to hold the full state. In fact, if the computation goal is to deliver a big state array, then there isn't much we can do, except writing to disk if the CPU RAM is too small or printing to screen by streaming from the GPU.
Here are the results of some benchmarks for 27 qubits and applying a single H to every qubit. I use 27 because it is the lowest number that a single GPU runs out of memory (single-GPU einsum works for 26 in this case). In the double-GPU set up we can split 27 to two halves and check if the corresponding manipulations cause any memory errors. In all cases I use two runs because Tensorflow tends to be slower on the first one.
Script type | Time 1 (sec) | Time 2 (sec) |
---|---|---|
single_device CPU |
80.0839 | 78.6606 |
single_device GPU |
Failed | |
double_device_full (casting on CPU) |
24.8310 | 23.2966 |
double_device_full (casting on GPU) |
Failed | |
double_device_partial |
6.3848 | 4.2230 |
double_device_many_swaps |
19.9396 | 18.1040 |
What names mean:
single_device
: Simple tf.einsum
implementation on a single device.double_device_full
: The distributed einsum version with the two GPUs parallelized using joblib
. The state is split and concatenated back at each step, so the input/output is the full state
vector. The concatenation part has to happen on CPU.double_device_partial
: The same distributed einsum but now the input/output is a list of the two pieces [state0, state1]
there is no concatenation step. Note that in order to define state0
and state1
we need to identify a "global" qubit on which it is not possible to apply gates. In order to apply gates to this qubit we have to swap to another "global". For the example of Hadamards only one swap is required.double_device_many_swaps
: Same as 3, but we introduce one swap before each gate, in order to emulate the case of practical applications where more than one swaps will be required. 1 swap/gate that is used here is the worst case senario.I will invite you to a different repo that contains the code for all these. Another quick note is that the swap step requires some communication between different GPUs. The following simple code
with tf.device("/GPU:0"):
_state0 = tf.expand_dims(tf.gather(state0, 0, axis=new_global_qubit), axis=current_global_qubit)
_state1 = tf.expand_dims(tf.gather(state1, 0, axis=new_global_qubit), axis=current_global_qubit)
return tf.concat([_state0, _state1], axis=current_global_qubit)
works without any memory issues despite mixing states from different devices.
Note that the same table for 26 qubits is:
Script type | Time 1 (sec) | Time 2 (sec) |
---|---|---|
single_device CPU | 37.9692 | 37.3246 |
single_device GPU | 0.8817 | 0.0043 |
double_device_full (casting on CPU) | 12.6484 | 10.6681 |
double_device_full (casting on GPU) | 12.8652 | 11.2851 |
double_device_partial | 3.9784 | 2.0861 |
double_device_many_swaps | 10.2122 | 8.6147 |
Thanks @stavros11, I will have a look later today. The numbers you are quoting are from dom? If yes, then everything looks good, because its cpu is ultra-high end.
Yes, everything is from dom. I verified that the correct devices are used each time using tf.debugging.set_log_device_placement(True)
and with very few exceptions the log looked as expected. Although I don't have exact numbers, the CPU is noticeably faster compared to AWS CPU.
I repeated the benchmarks without additional processes running on the GPUs and indeed the results are improved. The table for 27 qubits is:
Script type | Time 1 (sec) | Time 2 (sec) |
---|---|---|
single_device CPU |
75.8836 | 75.3998 |
single_device GPU |
0.5017 | 0.3454 |
double_device_full (CPU casting) |
17.1563 | 15.2232 |
double_device_full (GPU casting) |
17.0698 | 16.6294 |
double_device_partial |
3.6225 | 2.2162 |
double_device_many_swaps |
11.2147 | 9.9565 |
Also, since a single GPU can now fit a 27-qubit state, in principle we can run 28 qubits with the dual-GPU setup. I tried to do so and while double_device_many_swaps
and double_device_full
work (with 23sec and 32sec respectively), double_device_partial
fails. I do not really understand this since the code between "many_swaps" and "partial" is almost identical. The only difference is that "many swaps" calls the swap method more times.
Regarding whether swapping uses CPU memory, I am not 100% sure but I think it does, although the memory management is a bit weird. I wrote a small script that generates two pieces of states and performs swaps between them (no other operations). Executing this for 28 qubits reserves about 1.6GB of CPU memory (according to htop
) from start but there are no further changes in memory use as the script runs (for example as it is performing the swaps).
Ok, thanks, so I immagine the CPU memory is used as a sort of streaming device to accelerate the transfer between GPU. The numbers you are quoting are using tf.function (for the function called by joblib)?
No, there is no compilation in any of these benchmarks because it is not properly implemented in my code. I plan to check this and also track memory usage more carefully when running the benchmark scripts. These tend to use much more CPU memory than the swap script, however there are also steps that cast the full state on CPU, so I don't know what part is used for this and what for the actual swaps.
I was checking the possibility to use the tf.distribution
, after some tests I believe the such approach still too experimental and not suited to our problem. However, there is 1 possibility which may work, could you please try to remove joblib and use something like this:
@tf.function
def distributed_einsum(state, gate):
einsum_str = # calculate appropriate indices
with tf.device("/GPU:0"):
state0 = tf.einsum(einsum_str, state[0], gate)
with tf.device("/GPU:1"):
state1 = tf.einsum(einsum_str, state[1], gate)
in principle, if the resulting graph has no intersections between devices then this should work in parallel.
I tried to use this approach however it does not seem to compile properly. I get an "AutoGraph could not transform..." warning and execution times are similar to eager mode. If I do not use @tf.function
, I cannot say for sure if the processes run in parallel because times are a bit worse than with joblib
(eg. double_device_many_swaps
takes ~12sec instead of 10). In nvidia-smi
both GPUs show utilization at the same time though.
Before you wrote, I implemented compiling using tf.function
on gpu_job
and then calling gpu_job
with joblib
. This approach also gives the same warnings. The graph probably has intersections between devices when we swap the global qubit.
Regarding memory, the state swap between devices is not very stable. When I use a script where I cast the full state on CPU then it automatically reserves 16GB of CPU memory and swap works even if I do not explicitly use the CPU there. However if I skip the full state step and just start with the two pieces state0
, state1
then it reserves only ~1.5GB of CPU memory and swap fails. The strange thing is that the script fails in an einsum step, not when the swap is performed.
Thanks for the feedback. Tomorrow I will try to have a definitive answer about the necessity of joblib. Now, concerning the warning, on dom tf2.1 was compiled with extra flags, these warnings will disappeared as soon as we move to tf2.2.
I created a virtualenv to install Qibo and when I ran setup.py
this upgraded tensorflow to 2.2.0-rc3
, so I could check using this version. Indeed the compilation warning disappears in this version and times for 27 qubits are:
Script type | Eager 1 (sec) | Eager 2 (sec) | Compiled 1 (sec) | Compiled 2 (sec) |
---|---|---|---|---|
single |
0.558888 | 0.798126 | 0.580431 | 0.798280 |
full (CPU casting) |
46.428436 | 45.898978 | 17.519732 | 15.914517 |
full (GPU casting) |
Fails | 18.622866 | 17.553890 | |
partial |
4.035464 | 3.223216 | 3.646084 | 2.012743 |
many_swaps |
13.716857 | 12.966983 | 11.384759 | 9.681156 |
This used the distributed_einsum
you proposed above, without joblib
. Note that eager times are a bit worse than the joblib
implementation on tf2.1 I reported above.
tf.2.2 seems to also be more stable in terms of the memory issues. Now all scripts fail for 28 qubits in eager mode (in contrast to tf2.1 where some of the scripts could run).
Ok, that's pretty good. Yeah, I am pretty sure the tf2.1 from the arch repo has some compilation flag which generates the warning, and indeed 2.2 is much better in terms of memory management.
Note that eager times are a bit worse than the joblib implementation on tf2.1 I reported above.
This makes sense because joblib was sending in parallel the eager execution, while now, if you remove the tf.function
the eager code is executed sequentially.
If I understand correctly, the partial
approach is the winner for the time being, correct?
From my side, I have checked explicitly that the @tf.function
should be enough to parallelize the computation (if there are absolutely no graph intersections).
I have tested the following simple code:
@tf.function
def run():
with tf.device('/gpu:1'):
c = QFT(25)
a = c()
with tf.device('/gpu:0'):
d = QFT(26)
e = d()
return a, e
I set a different number of qubits in order to get a similar workload for both devices. The results (after the dry run) are: | Setup | eager | compiled |
---|---|---|---|
RTX (both QFT) | 22s | 20s | |
TITANV (both QFT) | 11s | 10s | |
RTX+TITANV | 16s | 6s |
The last row is clear, the eager mode for both devices is the average time, while the compilation parallelizes the graph evaluation.
I also did a comparison between joblib
and the above simple approach and there is no much difference, especially the compiled case is equivalent. The numbers are different than above because the GPU was running other processes at the same time.
Script type | Eager | Compiled | Joblib Eager | Joblib Compiled |
---|---|---|---|---|
full | Failed | 20.338768 | Failed | 18.872773 |
partial | 4.923760 | 3.730968 | 4.734492 | 3.805022 |
many_swaps | 16.747087 | 13.257831 | 12.813699 | 14.053948 |
If I understand correctly, the partial approach is the winner for the time being, correct?
Yes, the partial script has always the shortest time, however I don't expect this to be true in actual applications which will need more swaps. A more realistic expectation would be something between partial and many swaps. My biggest concern though has to do with memory. Most of these approaches (excluding a compiled case) failed for 28 qubits which we cannot run on single GPU. In this sense we did not gain anything in terms of number of qubits from the multi-GPU setup.
Thanks for the further tests, so at least we can confirm the technology to deal with multigpu.
In this sense we did not gain anything in terms of number of qubits from the multi-GPU setup.
Not sure I follow your point, if we split a state over different gpus we could increase the number of qubits, what is the connection with the 28 failure?
Not sure I follow your point, if we split a state over different gpus we could increase the number of qubits, what is the connection with the 28 failure?
Exactly, so for the 28 qubit example we can split the state to two GPUs. However, although a single GPU fits a 27-qubit state, when we do
with tf.device("/GPU:0"):
state0 = tf.einsum(state0, gate)
where state0 is half of a 28-qubit state, this might fail depending on the rest of the script (and I don't fully understand when/why it fails).
In summary, I believe there several directions (by relevance):
Thank you for listing the important points. For the next few days I think it is good if we focus on number 1 and move on to 2 only when we have an acceptable implementation based on CPU RAM. I think 3 may or may not come up as we are working on 1. Note however that streaming batches to GPUs may make a partial
like approach impossible if the same GPU is to be used twice in the same calculation, but there is no other way to unlock the number of qubits.
My plan is to try implementing a toy version of 1 based on the benchmark scripts I already have. In its most basic state, this should be equivalent to full
that I already have, but generalized to arbitrary number of devices (not necessarily distinct). With a bit more thinking it might be possible to avoid recasting the state pieces on the CPU after every gate which should improve performance to the many_swaps
/partial
levels. Once I have this working I will try to do QFT benchmarks up to 31-32 qubits (reusing one/two GPUs) and compare with CPU only.
Let me know if you have any other suggestions.
I fully agree with your plan. Point 1 is the priority and the target goal for the time being.
Short progress update:
I extended the full
script which was keeping the state on CPU and was using GPUs for calculation to allow reusing the GPU, in order to unlock the number of qubits (code in many_devices
branch of my other repo).
I redid the Hadamards benchmark with GPU memory limited to 2GB (to have shorter runs and not interfere with running processes). For 2GB, a single GPU can run up to 24 qubits, so two GPUs are required for 25, four for 26, etc.. Indeed the multi-GPU setup seems better than using CPU-only:
nqubits |
CPU (sec) | multi-GPU (sec) |
---|---|---|
24 | 9.837 | 0.171 (single GPU) |
25 | 21.215 | 8.024 (two GPUs) |
26 | 45.314 | 14.327 (four GPUs) |
where four GPUs = reusing the two GPUs twice each. The numbers are complex128 and compiled without joblib
for GPUs, measuring the second run.
My main issue before moving to a more practical benchmark (eg. QFT) is that the approach fails for more qubits, eg. if I try to do 27 qubits / 8 GPUs (reuse each GPU four times), it runs out of memory. So, even though in principle this approach should work with any nqubits
, this is not the case in practice...
Amazing results, I imagine the performance will be even more striking when the GPUs are idle. Maybe this memory issue is correlated somehow with #62 , looks like the factor 5 is also there, right?
Following #62 here, I have to add that the CPU measurements are with tf.einsum
, so perhaps using Cirq which is among the best in terms of CPU performance may be a more fair comparison.
Regarding the factor 5, I cannot say for sure if it appears here because it was measured on CPU runs. What I can do is play with the memory limiter and see when it fails. For 25 qubits ( = 512MB complex128 state vector) on single GPU it works with the limiter set to 2561MB but fails at 2560MB. From this it looks that the factor 5 appears on GPU as well.
Together with the memory limiter you could also use tf.config.experimental.set_memory_growth(gpu, False)
Here are some timings I did on the free GPUs without memory limiter:
nqubits | 28 | 29 | 30 |
---|---|---|---|
CPU | --- | --- | --- |
Cirq (complex64) | 38.1186 | 73.1375 | 150.2487 |
Qibo tf2.2.0rc3 | 62.6856 | 128.2284 | 263.0199 |
Multi-GPU tf2.2.0rc3 | --- | --- | --- |
eager | 127.3192 (x2) | 306.2534 (x16) | |
eager (best case) | 16.8838 (x2) | 34.5439 (x16) | |
joblib | 121.7111 (x2) | 299.3323 (x16) | |
joblib (best case) | 15.7139 (x2) | 34.4335 (x16) | |
compiled | 54.04897 (x2) | 174.7214 (x128) | |
compiled (best case) | 11.1155 (x4) | ||
Multi-GPU tf.2.1.0 | --- | --- | --- |
eager | 51.4750 (x2) | 131.8495 (x8) | 324.4805 (x32) |
eager (best case) | 11.4547 (x4) | 24.83086 (x8) | |
joblib | 46.8891 (x2) | 124.6499 (x8) | 309.9851 (x32) |
joblib (best case) | 11.1185 (x4) | 23.6628 (x16) | |
compiled | 37.1277 (x2) | 124.2385 (x16) | 319.8445 (x64) |
compiled (best case) | 10.58042 (x4) | 23.8628 (x32) |
nqubits
.gather_nd
.@stavros11 thanks for these numbers.
As next step, I propose the following:
Here an example of tf + multi-gpu setup: https://arxiv.org/abs/2002.12921