scarrazza commented 4 years ago

Here an example of tf + multi-gpu setup: https://arxiv.org/abs/2002.12921

scarrazza commented 4 years ago

Some possible approaches are presented in:

stavros11 commented 4 years ago

I would like to have a go on this over the next few days. I am planning to start with an "experimental" implementation based on communication protocol of qHiPSTER, mostly to get familiar with how mutliprocessing works on Tensorflow. If everything goes well, we can then try other more memory efficient protocols.

Looking at VegasFlow code it seems that we can use tf.device to execute code in the appropriate device. There are two questions related to that. Assume that we have partitioned a n-qubit state to two (n-1)-qubit states state0 and state1 saved on two GPUs.

Any gate that does not involve qubit-0 can be applied without communication as:
```
with tf.device('/gpu:0'):
state0 = gate(state0)
with tf.device('/gpu:1'):
state1 = gate(state1)
```
Does this run in parallel in both devices?
A gate that involves qubit 0 requires communication. Is
```
with tf.device('/gpu:0'):
state0 = gate(state0, state1)
```
the proper way to do this or shall we first transfer manually state1 to gpu:0 somehow?

scarrazza commented 4 years ago

The qHiPSTER approach seems a reasonable starting point. At some point I was also thinking about porting basic linear algebra operations to multi-GPU, i.e. implementing the einsum or matmul for larger matrices.

Concerning your points:

The code is correct, in the sense that it will be executed on difference devices, however this will not run in parallel in tensorflow >= 2. So far the easiest approach to parallelization consists in distributing graph evaluations using a multiprocessing library such as joblib. For example, we can allocate a pool and launch the graphs with:
```
@tf.function
def run_job(input_state, device):
with tf.device(device):
     state = gate(input_state)
return state
```

use 2 devices, 1 per states

pool = joblib.Parallel(n_jobs=2, prefer="threads") config = zip([state0, state1],['/gpu:0','/gpu:1']) result = pool(joblib.delayed(run_job)(state, device) for state, device in config)

with tf.device('/cpu:0'): final = gate(result[0], result[1])


Other techniques are possible, however they are experimental and designed for high level ML training, see https://www.tensorflow.org/guide/distributed_training.

2. Yes, but I believe in this context you may need to perform this operation on CPU, due to memory limitations.

scarrazza commented 4 years ago

I have played a little bit with the device manipulation with large objects.

if we invoke an object created in a GPU without the tf.device('gpu') then tf copies the object to CPU RAM
if we call an object created in gpu:0 inside a tf.device('gpu:1') the program tries to do not populate the CPU RAM during the transfer.
if you need GPU memory in order to copy all states from other cards, we have to delete local large objects manually, e.g. using del and setting the object to None.

stavros11 commented 4 years ago

I have also been playing around a bit with this and although I like the idea of avoiding to do the state manipulation manually, I cannot find a way to implement this in practice. Even if we attempt to do a multi-GPU einsum implementation as a third einsum backend, this should split the state internally and call tf.einsum on the pieces, something like:

def distributed_einsum(state, gate):
    einsum_str = # calculate appropriate indices
    with tf.device("/GPU:0"):
        state0 = tf.einsum(einsum_str, state[0], gate)
    with tf.device("/GPU:1"):
        state1 = tf.einsum(einsum_str, state[1], gate)

or the same parallelized with joblib as above. This approach seems to work even when state does not fit in the a memory of a single GPU. Eg. for 27 qubits doing tf.einsum(..., state, gate) runs out of memory but tf.einsum(..., state[0], gate) works.

The next question is whether the above function should return state or [state0, state1]. If we want to keep our current code it should return the full state, which means that we need an additional stitching step: state = tf.concat([state0[tf.newaxis], state1[tf.newaxis]], axis=0). There are several issues with this:

If we don't specify device, tensorflow will use the first GPU and will probably run out of memory.
If we specify CPU as device (assuming it has enough memory), it will work, but according to my current tests will be much slower than returning just [state0, state1].

The test I did so far is generating a random state and applying H to all qubits. For 27 qubits this takes ~5sec without the tf.concat step and 20sec with tf.concat. Note however that this test is biased because it does not require changing the "global" qubit a lot. I plan to implement a fair comparison to check whether the difference remains that high.

scarrazza commented 4 years ago

Indeed that is the main point we have to decide. Concerning your test, how long the same calculation takes on CPU-only? I believe this extra 20s are quite acceptable when comparing to CPU.

Now in terms of strategy, I think that we can start from the assumption the CPU system has enough memory to hold the full state. In fact, if the computation goal is to deliver a big state array, then there isn't much we can do, except writing to disk if the CPU RAM is too small or printing to screen by streaming from the GPU.

stavros11 commented 4 years ago

Here are the results of some benchmarks for 27 qubits and applying a single H to every qubit. I use 27 because it is the lowest number that a single GPU runs out of memory (single-GPU einsum works for 26 in this case). In the double-GPU set up we can split 27 to two halves and check if the corresponding manipulations cause any memory errors. In all cases I use two runs because Tensorflow tends to be slower on the first one.

Script type	Time 1 (sec)	Time 2 (sec)
`single_device` CPU	80.0839	78.6606
`single_device` GPU	Failed
`double_device_full` (casting on CPU)	24.8310	23.2966
`double_device_full` (casting on GPU)	Failed
`double_device_partial`	6.3848	4.2230
`double_device_many_swaps`	19.9396	18.1040

What names mean:

single_device: Simple tf.einsum implementation on a single device.
double_device_full: The distributed einsum version with the two GPUs parallelized using joblib. The state is split and concatenated back at each step, so the input/output is the full state vector. The concatenation part has to happen on CPU.
double_device_partial: The same distributed einsum but now the input/output is a list of the two pieces [state0, state1] there is no concatenation step. Note that in order to define state0 and state1 we need to identify a "global" qubit on which it is not possible to apply gates. In order to apply gates to this qubit we have to swap to another "global". For the example of Hadamards only one swap is required.
double_device_many_swaps: Same as 3, but we introduce one swap before each gate, in order to emulate the case of practical applications where more than one swaps will be required. 1 swap/gate that is used here is the worst case senario.

I will invite you to a different repo that contains the code for all these. Another quick note is that the swap step requires some communication between different GPUs. The following simple code

with tf.device("/GPU:0"):
    _state0 = tf.expand_dims(tf.gather(state0, 0, axis=new_global_qubit), axis=current_global_qubit)
    _state1 = tf.expand_dims(tf.gather(state1, 0, axis=new_global_qubit), axis=current_global_qubit)
    return tf.concat([_state0, _state1], axis=current_global_qubit)

works without any memory issues despite mixing states from different devices.

stavros11 commented 4 years ago

Note that the same table for 26 qubits is:

Script type	Time 1 (sec)	Time 2 (sec)
single_device CPU	37.9692	37.3246
single_device GPU	0.8817	0.0043
double_device_full (casting on CPU)	12.6484	10.6681
double_device_full (casting on GPU)	12.8652	11.2851
double_device_partial	3.9784	2.0861
double_device_many_swaps	10.2122	8.6147

scarrazza commented 4 years ago

Thanks @stavros11, I will have a look later today. The numbers you are quoting are from dom? If yes, then everything looks good, because its cpu is ultra-high end.

stavros11 commented 4 years ago

Yes, everything is from dom. I verified that the correct devices are used each time using tf.debugging.set_log_device_placement(True) and with very few exceptions the log looked as expected. Although I don't have exact numbers, the CPU is noticeably faster compared to AWS CPU.

stavros11 commented 4 years ago

I repeated the benchmarks without additional processes running on the GPUs and indeed the results are improved. The table for 27 qubits is:

Script type	Time 1 (sec)	Time 2 (sec)
`single_device` CPU	75.8836	75.3998
`single_device` GPU	0.5017	0.3454
`double_device_full` (CPU casting)	17.1563	15.2232
`double_device_full` (GPU casting)	17.0698	16.6294
`double_device_partial`	3.6225	2.2162
`double_device_many_swaps`	11.2147	9.9565

Also, since a single GPU can now fit a 27-qubit state, in principle we can run 28 qubits with the dual-GPU setup. I tried to do so and while double_device_many_swaps and double_device_full work (with 23sec and 32sec respectively), double_device_partial fails. I do not really understand this since the code between "many_swaps" and "partial" is almost identical. The only difference is that "many swaps" calls the swap method more times.

Regarding whether swapping uses CPU memory, I am not 100% sure but I think it does, although the memory management is a bit weird. I wrote a small script that generates two pieces of states and performs swaps between them (no other operations). Executing this for 28 qubits reserves about 1.6GB of CPU memory (according to htop) from start but there are no further changes in memory use as the script runs (for example as it is performing the swaps).

scarrazza commented 4 years ago

Ok, thanks, so I immagine the CPU memory is used as a sort of streaming device to accelerate the transfer between GPU. The numbers you are quoting are using tf.function (for the function called by joblib)?

stavros11 commented 4 years ago

No, there is no compilation in any of these benchmarks because it is not properly implemented in my code. I plan to check this and also track memory usage more carefully when running the benchmark scripts. These tend to use much more CPU memory than the swap script, however there are also steps that cast the full state on CPU, so I don't know what part is used for this and what for the actual swaps.

scarrazza commented 4 years ago

I was checking the possibility to use the tf.distribution, after some tests I believe the such approach still too experimental and not suited to our problem. However, there is 1 possibility which may work, could you please try to remove joblib and use something like this:

@tf.function
def distributed_einsum(state, gate):
    einsum_str = # calculate appropriate indices
    with tf.device("/GPU:0"):
        state0 = tf.einsum(einsum_str, state[0], gate)
    with tf.device("/GPU:1"):
        state1 = tf.einsum(einsum_str, state[1], gate)

in principle, if the resulting graph has no intersections between devices then this should work in parallel.

stavros11 commented 4 years ago

I tried to use this approach however it does not seem to compile properly. I get an "AutoGraph could not transform..." warning and execution times are similar to eager mode. If I do not use @tf.function, I cannot say for sure if the processes run in parallel because times are a bit worse than with joblib (eg. double_device_many_swaps takes ~12sec instead of 10). In nvidia-smi both GPUs show utilization at the same time though.

Before you wrote, I implemented compiling using tf.function on gpu_job and then calling gpu_job with joblib. This approach also gives the same warnings. The graph probably has intersections between devices when we swap the global qubit.

Regarding memory, the state swap between devices is not very stable. When I use a script where I cast the full state on CPU then it automatically reserves 16GB of CPU memory and swap works even if I do not explicitly use the CPU there. However if I skip the full state step and just start with the two pieces state0, state1 then it reserves only ~1.5GB of CPU memory and swap fails. The strange thing is that the script fails in an einsum step, not when the swap is performed.

scarrazza commented 4 years ago

Thanks for the feedback. Tomorrow I will try to have a definitive answer about the necessity of joblib. Now, concerning the warning, on dom tf2.1 was compiled with extra flags, these warnings will disappeared as soon as we move to tf2.2.

stavros11 commented 4 years ago

I created a virtualenv to install Qibo and when I ran setup.py this upgraded tensorflow to 2.2.0-rc3, so I could check using this version. Indeed the compilation warning disappears in this version and times for 27 qubits are:

Script type	Eager 1 (sec)	Eager 2 (sec)	Compiled 1 (sec)	Compiled 2 (sec)
`single`	0.558888	0.798126	0.580431	0.798280
`full` (CPU casting)	46.428436	45.898978	17.519732	15.914517
`full` (GPU casting)	Fails		18.622866	17.553890
`partial`	4.035464	3.223216	3.646084	2.012743
`many_swaps`	13.716857	12.966983	11.384759	9.681156

This used the distributed_einsum you proposed above, without joblib. Note that eager times are a bit worse than the joblib implementation on tf2.1 I reported above.

tf.2.2 seems to also be more stable in terms of the memory issues. Now all scripts fail for 28 qubits in eager mode (in contrast to tf2.1 where some of the scripts could run).

scarrazza commented 4 years ago

Ok, that's pretty good. Yeah, I am pretty sure the tf2.1 from the arch repo has some compilation flag which generates the warning, and indeed 2.2 is much better in terms of memory management.

Note that eager times are a bit worse than the joblib implementation on tf2.1 I reported above.

This makes sense because joblib was sending in parallel the eager execution, while now, if you remove the tf.function the eager code is executed sequentially.

If I understand correctly, the partial approach is the winner for the time being, correct?

From my side, I have checked explicitly that the @tf.function should be enough to parallelize the computation (if there are absolutely no graph intersections).

I have tested the following simple code:

@tf.function
def run():
    with tf.device('/gpu:1'):
        c = QFT(25)
        a = c()
    with tf.device('/gpu:0'):
        d = QFT(26)
        e = d()
    return a, e

I set a different number of qubits in order to get a similar workload for both devices. The results (after the dry run) are:	Setup	eager
RTX (both QFT)	22s	20s
TITANV (both QFT)	11s	10s
RTX+TITANV	16s	6s

The last row is clear, the eager mode for both devices is the average time, while the compilation parallelizes the graph evaluation.

stavros11 commented 4 years ago

I also did a comparison between joblib and the above simple approach and there is no much difference, especially the compiled case is equivalent. The numbers are different than above because the GPU was running other processes at the same time.

Script type	Eager	Compiled	Joblib Eager	Joblib Compiled
full	Failed	20.338768	Failed	18.872773
partial	4.923760	3.730968	4.734492	3.805022
many_swaps	16.747087	13.257831	12.813699	14.053948

If I understand correctly, the partial approach is the winner for the time being, correct?

Yes, the partial script has always the shortest time, however I don't expect this to be true in actual applications which will need more swaps. A more realistic expectation would be something between partial and many swaps. My biggest concern though has to do with memory. Most of these approaches (excluding a compiled case) failed for 28 qubits which we cannot run on single GPU. In this sense we did not gain anything in terms of number of qubits from the multi-GPU setup.

scarrazza commented 4 years ago

Thanks for the further tests, so at least we can confirm the technology to deal with multigpu.

In this sense we did not gain anything in terms of number of qubits from the multi-GPU setup.

Not sure I follow your point, if we split a state over different gpus we could increase the number of qubits, what is the connection with the 28 failure?

stavros11 commented 4 years ago

Not sure I follow your point, if we split a state over different gpus we could increase the number of qubits, what is the connection with the 28 failure?

Exactly, so for the 28 qubit example we can split the state to two GPUs. However, although a single GPU fits a 27-qubit state, when we do

with tf.device("/GPU:0"):
    state0 = tf.einsum(state0, gate)

where state0 is half of a 28-qubit state, this might fail depending on the rest of the script (and I don't fully understand when/why it fails).

scarrazza commented 4 years ago

In summary, I believe there several directions (by relevance):

Simulate a maximum number of qubits which can be loaded in he CPU memory by creating chunks of data which are streamed to the GPU devices. Hypothesis: the multi-gpu approach will be faster than CPU.
Simulate an "infinite" number of qubits by storing the state to disk using e.g. hdf5. Hypothesis: a cost effective solution for people without clusters or powerful hardware.
Study the circuits/algorithm which may benefit from a pure partial/many_swaps approach. Hypothesis: identify algorithms/circuits which may benefit from this split.

stavros11 commented 4 years ago

Thank you for listing the important points. For the next few days I think it is good if we focus on number 1 and move on to 2 only when we have an acceptable implementation based on CPU RAM. I think 3 may or may not come up as we are working on 1. Note however that streaming batches to GPUs may make a partial like approach impossible if the same GPU is to be used twice in the same calculation, but there is no other way to unlock the number of qubits.

My plan is to try implementing a toy version of 1 based on the benchmark scripts I already have. In its most basic state, this should be equivalent to full that I already have, but generalized to arbitrary number of devices (not necessarily distinct). With a bit more thinking it might be possible to avoid recasting the state pieces on the CPU after every gate which should improve performance to the many_swaps/partial levels. Once I have this working I will try to do QFT benchmarks up to 31-32 qubits (reusing one/two GPUs) and compare with CPU only.

Let me know if you have any other suggestions.

scarrazza commented 4 years ago

I fully agree with your plan. Point 1 is the priority and the target goal for the time being.

stavros11 commented 4 years ago

Short progress update: I extended the full script which was keeping the state on CPU and was using GPUs for calculation to allow reusing the GPU, in order to unlock the number of qubits (code in many_devices branch of my other repo).

I redid the Hadamards benchmark with GPU memory limited to 2GB (to have shorter runs and not interfere with running processes). For 2GB, a single GPU can run up to 24 qubits, so two GPUs are required for 25, four for 26, etc.. Indeed the multi-GPU setup seems better than using CPU-only:

`nqubits`	CPU (sec)	multi-GPU (sec)
24	9.837	0.171 (single GPU)
25	21.215	8.024 (two GPUs)
26	45.314	14.327 (four GPUs)

where four GPUs = reusing the two GPUs twice each. The numbers are complex128 and compiled without joblib for GPUs, measuring the second run.

My main issue before moving to a more practical benchmark (eg. QFT) is that the approach fails for more qubits, eg. if I try to do 27 qubits / 8 GPUs (reuse each GPU four times), it runs out of memory. So, even though in principle this approach should work with any nqubits, this is not the case in practice...

scarrazza commented 4 years ago

Amazing results, I imagine the performance will be even more striking when the GPUs are idle. Maybe this memory issue is correlated somehow with #62 , looks like the factor 5 is also there, right?

stavros11 commented 4 years ago

Following #62 here, I have to add that the CPU measurements are with tf.einsum, so perhaps using Cirq which is among the best in terms of CPU performance may be a more fair comparison.

Regarding the factor 5, I cannot say for sure if it appears here because it was measured on CPU runs. What I can do is play with the memory limiter and see when it fails. For 25 qubits ( = 512MB complex128 state vector) on single GPU it works with the limiter set to 2561MB but fails at 2560MB. From this it looks that the factor 5 appears on GPU as well.

scarrazza commented 4 years ago

Together with the memory limiter you could also use tf.config.experimental.set_memory_growth(gpu, False)

stavros11 commented 4 years ago

Here are some timings I did on the free GPUs without memory limiter:

nqubits	28	29	30
CPU	---	---	---
Cirq (complex64)	38.1186	73.1375	150.2487
Qibo tf2.2.0rc3	62.6856	128.2284	263.0199
Multi-GPU tf2.2.0rc3	---	---	---
eager	127.3192 (x2)	306.2534 (x16)
eager (best case)	16.8838 (x2)	34.5439 (x16)
joblib	121.7111 (x2)	299.3323 (x16)
joblib (best case)	15.7139 (x2)	34.4335 (x16)
compiled	54.04897 (x2)	174.7214 (x128)
compiled (best case)	11.1155 (x4)
Multi-GPU tf.2.1.0	---	---	---
eager	51.4750 (x2)	131.8495 (x8)	324.4805 (x32)
eager (best case)	11.4547 (x4)	24.83086 (x8)
joblib	46.8891 (x2)	124.6499 (x8)	309.9851 (x32)
joblib (best case)	11.1185 (x4)	23.6628 (x16)
compiled	37.1277 (x2)	124.2385 (x16)	319.8445 (x64)
compiled (best case)	10.58042 (x4)	23.8628 (x32)

x2, x4, ... in multi-GPU runs is the number of "logical" GPUs that were used to distribute the calculation. I use the minimum number of GPUs that does not lead to memory failure in each case. It seems that this number differs for eager/compiled even for the same nqubits.
Empty cells means that the run failed for less or equal to 128 logical GPUs. This is the maximum slicing number allowed by gather_nd.
"best case" means that I cast the full state back to CPU the minimum times possible for this example.

scarrazza commented 4 years ago

@stavros11 thanks for these numbers.

As next step, I propose the following:

Recompute on DOM the CPU numbers using just 8 cores (taskset -c 0-7 python Githubissues.
Githubissues is a development platform for aggregating issues.

qiboteam / qibo

Multi-GPU approach #19

use 2 devices, 1 per states