Create a new method to return the final state vector array instead of wrapping it

NoureldinYosri commented 1 year ago

This is to avoid the numpy limit on the number of dimensions https://github.com/quantumlib/Cirq/issues/6031

The 1D representation should only be used when the number of qubits is greater than the numpy limit on the number of dimensions (currently set to 32) https://github.com/numpy/numpy/issues/5744.

_, state_vector, _ = s.simulate_into_1d_array(c)

fixes https://github.com/quantumlib/Cirq/issues/6031

NoureldinYosri commented 1 year ago

@95-martin-orion This PR is just a workaround to solve https://github.com/quantumlib/Cirq/issues/6031 until numpy starts to support more than 32 dimensions https://github.com/numpy/numpy/issues/5744

NoureldinYosri commented 1 year ago

@95-martin-orion

Some docstring requests, otherwise this LGTM. A new qsimcirq version is necessary to make this generally available - would you like me to cut a new release?

yes, please :smile:

NoureldinYosri commented 1 year ago

@95-martin-orion @sergeisakov ptal, I fixed all CIs as explained in the description. could you readd the kororun label?

95-martin-orion commented 1 year ago

Thank you for the myriad fixes, @NoureldinYosri !

Logs for the Kokoro error can be found here. I unfortunately don't have much context on this, though I do know that the Kokoro tests are not affected by the bazeltest.yml file.

NoureldinYosri commented 1 year ago

@95-martin-orion from the logs

WARNING: Download from [https://storage.googleapis.com/mirror.tensorflow.org/github.com/tensorflow/runtime/archive/4ce3e4da2e21ae4dfcee9366415e55f408c884ec.tar.gz](https://www.google.com/url?q=https://storage.googleapis.com/mirror.tensorflow.org/github.com/tensorflow/runtime/archive/4ce3e4da2e21ae4dfcee9366415e55f408c884ec.tar.gz&sa=D) failed: class java.io.FileNotFoundException GET returned 404 Not Found

It tries to download an old version of the TF runtime that no longer exists https://storage.googleapis.com/mirror.tensorflow.org/github.com/tensorflow/runtime/archive/4ce3e4da2e21ae4dfcee9366415e55f408c884ec.tar.gz

the versions that are still hosted on storage.googleapis.com/mirror.tensorflow.org are in http://mirror.tensorflow.org/. Where does it decide to go for that specific version of the runtime?

Looking deeper in the logs it looks like it pypasses that error and then gets a cuda11 environment but then decides to look at cuda12

ERROR: An error occurred during the fetch of repository 'ubuntu20.04-gcc9_manylinux2014-cuda11.2-cudnn8.1-tensorrt7.2_config_cuda':
...
No library found under: /usr/local/cuda-12.2/targets/x86_64-linux/lib/libcupti.so.12.2

this looks to be the real problem

95-martin-orion commented 1 year ago

{...} Where does it decide to go for that specific version of the runtime?

The files for this are stored in Google-internal repositories - I'll email you the links.

rht commented 1 year ago

@NoureldinYosri thank you once again for this feature. May I know the timeline for the next release for qsim?

95-martin-orion commented 1 year ago

@rht qsim releases are on an "as-needed" basis, which I think this qualifies for. I've opened #631 to cut the release.

95-martin-orion commented 1 year ago

@rht A new release has been cut and should be visible on pypi in the next 10-20 minutes.

rht commented 1 year ago

I see, thank you, just in time to do huge statevector for Halloween!

rht commented 9 months ago

@NoureldinYosri there was a delay in using this feature in our production instances. We were waiting for the cuQuantum Appliance to have qsimcirq>=0.17.x (https://github.com/NVIDIA/cuQuantum/issues/98), but it hasn't happened.

But I was able to test this PR by straight up patching on qsimcirq 0.15.0 on cuQuantum Appliance 23.10. I am running a 2xA100 instance, with the following code

import time

from memory_profiler import memory_usage
import cirq
import qsimcirq

def f():
    num_qubits = 33
    qc_cirq = cirq.Circuit()
    qubits = cirq.LineQubit.range(num_qubits)
    for i in range(num_qubits):
        qc_cirq.append(cirq.H(qubits[i]))
    sim = qsimcirq.QSimSimulator()
    tic = time.time()
    # sim = cirq.Simulator()
    print("?", sim.simulate_into_1d_array)
    sim.simulate_into_1d_array(qc_cirq)
    print("Elapsed", time.time() - tic)
# print("Max memory", max(memory_usage(f)))
f()

but still got this OOM error

? <bound method QSimSimulator.simulate_into_1d_array of <qsimcirq.qsim_simulator.QSimSimulator object at 0x7f9e9ac62770>>
CUDA error: out of memory vector_mgpu.h 116

Here is the benchmark result for 32 qubits (haven't measured GPU memory usage from nvidia-smi yet)

Elapsed 14.033143758773804
Max memory 34182.4296875

Here is the manual patch I applied

535c535
<     def simulate_sweep_iter(
---
>     def _simulate_impl(
541,570c541
<     ) -> Iterator[cirq.StateVectorTrialResult]:
<         """Simulates the supplied Circuit.
< 
<         This method returns a result which allows access to the entire
<         wave function. In contrast to simulate, this allows for sweeping
<         over different parameter values.
< 
<         Avoid using this method with `use_gpu=True` in the simulator options;
<         when used with GPU this method must copy state from device to host memory
<         multiple times, which can be very slow. This issue is not present in
<         `simulate_expectation_values_sweep`.
< 
<         Args:
<             program: The circuit to simulate.
<             params: Parameters to run with the program.
<             qubit_order: Determines the canonical ordering of the qubits. This is
<               often used in specifying the initial state, i.e. the ordering of the
<               computational basis states.
<             initial_state: The initial state for the simulation. This can either
<               be an integer representing a pure state (e.g. 11010) or a numpy
<               array containing the full state vector. If none is provided, this
<               is assumed to be the all-zeros state.
< 
<         Returns:
<             List of SimulationTrialResults for this run, one for each
<             possible parameter resolver.
< 
<         Raises:
<             TypeError: if an invalid initial_state is provided.
<         """
---
>     ) -> Iterator[Tuple[cirq.ParamResolver, np.ndarray, Sequence[int]]]:
625a597,649
>             yield prs, qsim_state.view(np.complex64), cirq_order
> 
>     def simulate_into_1d_array(
>         self,
>         program: cirq.AbstractCircuit,
>         param_resolver: cirq.ParamResolverOrSimilarType = None,
>         qubit_order: cirq.QubitOrderOrList = cirq.ops.QubitOrder.DEFAULT,
>         initial_state: Any = None,
>     ) -> Tuple[cirq.ParamResolver, np.ndarray, Sequence[int]]:
>         """Same as simulate() but returns raw simulation result without wrapping it.
>             The returned result is not wrapped in a StateVectorTrialResult but can be used
>             to create a StateVectorTrialResult.
>         Returns:
>             Tuple of (param resolver, final state, qubit order)
>         """
>         params = cirq.study.ParamResolver(param_resolver)
>         return next(self._simulate_impl(program, params, qubit_order, initial_state))
> 
>     def simulate_sweep_iter(
>         self,
>         program: cirq.Circuit,
>         params: cirq.Sweepable,
>         qubit_order: cirq.QubitOrderOrList = cirq.QubitOrder.DEFAULT,
>         initial_state: Optional[Union[int, np.ndarray]] = None,
>     ) -> Iterator[cirq.StateVectorTrialResult]:
>         """Simulates the supplied Circuit.
>         This method returns a result which allows access to the entire
>         wave function. In contrast to simulate, this allows for sweeping
>         over different parameter values.
>         Avoid using this method with `use_gpu=True` in the simulator options;
>         when used with GPU this method must copy state from device to host memory
>         multiple times, which can be very slow. This issue is not present in
>         `simulate_expectation_values_sweep`.
>         Args:
>             program: The circuit to simulate.
>             params: Parameters to run with the program.
>             qubit_order: Determines the canonical ordering of the qubits. This is
>               often used in specifying the initial state, i.e. the ordering of the
>               computational basis states.
>             initial_state: The initial state for the simulation. This can either
>               be an integer representing a pure state (e.g. 11010) or a numpy
>               array containing the full state vector. If none is provided, this
>               is assumed to be the all-zeros state.
>         Returns:
>             Iterator over SimulationTrialResults for this run, one for each
>             possible parameter resolver.
>         Raises:
>             TypeError: if an invalid initial_state is provided.
>         """
> 
>         for prs, state_vector, cirq_order in self._simulate_impl(
>             program, params, qubit_order, initial_state
>         ):
627c651
<                 initial_state=qsim_state.view(np.complex64), qubits=cirq_order
---
>                 initial_state=np.complex64, qubits=cirq_order

rht commented 9 months ago

Something is still consuming more GPU memory much more than in the past. I used to be able to do 33 qubits on a 2xA100 instance.

$ nvidia-smi
Thu Feb  8 00:07:04 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P0    63W / 400W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  Off  | 00000000:00:05.0 Off |                    0 |
| N/A   36C    P0    61W / 400W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

NoureldinYosri commented 9 months ago

For 32 qubits we have a state vector of $2^{32}$ complex entries each of each is 2 float32 numbers or 8 bytes so that is $2^{35}$ bytes so we should expect usage of at least $2^{35}$ bytes or 32GB . the value in https://github.com/quantumlib/qsim/pull/623#issuecomment-1933141446 is 34182.4296875 MB or $~34.1$ GB. so we are only using $2.1$ GB more memory than the minimum necessary which I suppose is consumed by numpy overhead, other variables and maybe auxilary variables that will eventually be cleaned up by the garbage collector.

Are sure you could do 33 qubits on this machine?. The same calculation gives $2^{36}$ bytes or 64GB for 33 qubits. per https://www.aime.info/en/shop/product/aime-gpu-cloud-v242xa100/?pid=V28-2XA100-D1 2xA100 has only 40GB of ram per GPU.

rht commented 9 months ago

Are sure you could do 33 qubits on this machine?

Yes, we are able to do so on qsimcirq==0.12.1 via the cuQuantum Appliance, which has a multi-GPU backend. Hence, 2x40 GB is more than enough for 64 GB requirement of 33 qubits.

I am in the process of measuring the max GPU memory consumption by polling nvidia-smi in the background while the simulation is running, but this will take a while since I have terminated the instance and will have to wait until there is an open slot for the 2xA100 instance.

rht commented 9 months ago

Update: all is good! I am able to run 33 qubits on the 2xA100 instance. I confirm this PR works.

The bug in the code in https://github.com/quantumlib/qsim/pull/623#issuecomment-1933141446 was that I forgot to specify

    options = qsimcirq.QSimOptions(gpu_mode=2)
    sim = qsimcirq.QSimSimulator(options)

My measurements (I'm not sure why the GPU memory is that low, but anyway, it works):

CPU only:
num_qubits 32
Elapsed 114.16660451889038
Peak GPU memory usage: 3 MiB
Max CPU memory 33086.91015625

GPU:
num_qubits 31
Elapsed 14.6939697265625
Peak GPU memory usage: 425 MiB
Max CPU memory 16830.81640625

num_qubits 32
Elapsed 28.458886861801147
Peak GPU memory usage: 425 MiB
Max CPU memory 33174.0078125

num_qubits 33
Elapsed 17.026336431503296
Peak GPU memory usage: 853 MiB
Max CPU memory 67345.63671875

The GPU memory is measured by reading the output of nvidia-smi --query-gpu=memory.used --format=csv.

rht commented 9 months ago

(I'm not sure why the GPU memory is that low, but anyway, it works)

My guess is that the time spent on the GPU is somewhat lower than the interval the nvidia-smi measures the VRAM (0.01 s).

quantumlib / qsim

Create a new method to return the final state vector array instead of wrapping it #623