Closed NoureldinYosri closed 1 year ago
@95-martin-orion This PR is just a workaround to solve https://github.com/quantumlib/Cirq/issues/6031 until numpy starts to support more than 32 dimensions https://github.com/numpy/numpy/issues/5744
@95-martin-orion
Some docstring requests, otherwise this LGTM. A new
qsimcirq
version is necessary to make this generally available - would you like me to cut a new release?
yes, please :smile:
@95-martin-orion @sergeisakov ptal, I fixed all CIs as explained in the description. could you readd the kororun label?
Thank you for the myriad fixes, @NoureldinYosri !
Logs for the Kokoro error can be found here. I unfortunately don't have much context on this, though I do know that the Kokoro tests are not affected by the bazeltest.yml
file.
@95-martin-orion from the logs
WARNING: Download from [https://storage.googleapis.com/mirror.tensorflow.org/github.com/tensorflow/runtime/archive/4ce3e4da2e21ae4dfcee9366415e55f408c884ec.tar.gz](https://www.google.com/url?q=https://storage.googleapis.com/mirror.tensorflow.org/github.com/tensorflow/runtime/archive/4ce3e4da2e21ae4dfcee9366415e55f408c884ec.tar.gz&sa=D) failed: class java.io.FileNotFoundException GET returned 404 Not Found
It tries to download an old version of the TF runtime that no longer exists https://storage.googleapis.com/mirror.tensorflow.org/github.com/tensorflow/runtime/archive/4ce3e4da2e21ae4dfcee9366415e55f408c884ec.tar.gz
the versions that are still hosted on storage.googleapis.com/mirror.tensorflow.org are in http://mirror.tensorflow.org/. Where does it decide to go for that specific version of the runtime?
Looking deeper in the logs it looks like it pypasses that error and then gets a cuda11 environment but then decides to look at cuda12
ERROR: An error occurred during the fetch of repository 'ubuntu20.04-gcc9_manylinux2014-cuda11.2-cudnn8.1-tensorrt7.2_config_cuda':
...
No library found under: /usr/local/cuda-12.2/targets/x86_64-linux/lib/libcupti.so.12.2
this looks to be the real problem
{...} Where does it decide to go for that specific version of the runtime?
The files for this are stored in Google-internal repositories - I'll email you the links.
@NoureldinYosri thank you once again for this feature. May I know the timeline for the next release for qsim?
@rht qsim releases are on an "as-needed" basis, which I think this qualifies for. I've opened #631 to cut the release.
@rht A new release has been cut and should be visible on pypi in the next 10-20 minutes.
I see, thank you, just in time to do huge statevector for Halloween!
@NoureldinYosri there was a delay in using this feature in our production instances. We were waiting for the cuQuantum Appliance to have qsimcirq>=0.17.x (https://github.com/NVIDIA/cuQuantum/issues/98), but it hasn't happened.
But I was able to test this PR by straight up patching on qsimcirq 0.15.0 on cuQuantum Appliance 23.10. I am running a 2xA100 instance, with the following code
import time
from memory_profiler import memory_usage
import cirq
import qsimcirq
def f():
num_qubits = 33
qc_cirq = cirq.Circuit()
qubits = cirq.LineQubit.range(num_qubits)
for i in range(num_qubits):
qc_cirq.append(cirq.H(qubits[i]))
sim = qsimcirq.QSimSimulator()
tic = time.time()
# sim = cirq.Simulator()
print("?", sim.simulate_into_1d_array)
sim.simulate_into_1d_array(qc_cirq)
print("Elapsed", time.time() - tic)
# print("Max memory", max(memory_usage(f)))
f()
but still got this OOM error
? <bound method QSimSimulator.simulate_into_1d_array of <qsimcirq.qsim_simulator.QSimSimulator object at 0x7f9e9ac62770>>
CUDA error: out of memory vector_mgpu.h 116
Here is the benchmark result for 32 qubits (haven't measured GPU memory usage from nvidia-smi
yet)
Elapsed 14.033143758773804
Max memory 34182.4296875
Here is the manual patch I applied
535c535
< def simulate_sweep_iter(
---
> def _simulate_impl(
541,570c541
< ) -> Iterator[cirq.StateVectorTrialResult]:
< """Simulates the supplied Circuit.
<
< This method returns a result which allows access to the entire
< wave function. In contrast to simulate, this allows for sweeping
< over different parameter values.
<
< Avoid using this method with `use_gpu=True` in the simulator options;
< when used with GPU this method must copy state from device to host memory
< multiple times, which can be very slow. This issue is not present in
< `simulate_expectation_values_sweep`.
<
< Args:
< program: The circuit to simulate.
< params: Parameters to run with the program.
< qubit_order: Determines the canonical ordering of the qubits. This is
< often used in specifying the initial state, i.e. the ordering of the
< computational basis states.
< initial_state: The initial state for the simulation. This can either
< be an integer representing a pure state (e.g. 11010) or a numpy
< array containing the full state vector. If none is provided, this
< is assumed to be the all-zeros state.
<
< Returns:
< List of SimulationTrialResults for this run, one for each
< possible parameter resolver.
<
< Raises:
< TypeError: if an invalid initial_state is provided.
< """
---
> ) -> Iterator[Tuple[cirq.ParamResolver, np.ndarray, Sequence[int]]]:
625a597,649
> yield prs, qsim_state.view(np.complex64), cirq_order
>
> def simulate_into_1d_array(
> self,
> program: cirq.AbstractCircuit,
> param_resolver: cirq.ParamResolverOrSimilarType = None,
> qubit_order: cirq.QubitOrderOrList = cirq.ops.QubitOrder.DEFAULT,
> initial_state: Any = None,
> ) -> Tuple[cirq.ParamResolver, np.ndarray, Sequence[int]]:
> """Same as simulate() but returns raw simulation result without wrapping it.
> The returned result is not wrapped in a StateVectorTrialResult but can be used
> to create a StateVectorTrialResult.
> Returns:
> Tuple of (param resolver, final state, qubit order)
> """
> params = cirq.study.ParamResolver(param_resolver)
> return next(self._simulate_impl(program, params, qubit_order, initial_state))
>
> def simulate_sweep_iter(
> self,
> program: cirq.Circuit,
> params: cirq.Sweepable,
> qubit_order: cirq.QubitOrderOrList = cirq.QubitOrder.DEFAULT,
> initial_state: Optional[Union[int, np.ndarray]] = None,
> ) -> Iterator[cirq.StateVectorTrialResult]:
> """Simulates the supplied Circuit.
> This method returns a result which allows access to the entire
> wave function. In contrast to simulate, this allows for sweeping
> over different parameter values.
> Avoid using this method with `use_gpu=True` in the simulator options;
> when used with GPU this method must copy state from device to host memory
> multiple times, which can be very slow. This issue is not present in
> `simulate_expectation_values_sweep`.
> Args:
> program: The circuit to simulate.
> params: Parameters to run with the program.
> qubit_order: Determines the canonical ordering of the qubits. This is
> often used in specifying the initial state, i.e. the ordering of the
> computational basis states.
> initial_state: The initial state for the simulation. This can either
> be an integer representing a pure state (e.g. 11010) or a numpy
> array containing the full state vector. If none is provided, this
> is assumed to be the all-zeros state.
> Returns:
> Iterator over SimulationTrialResults for this run, one for each
> possible parameter resolver.
> Raises:
> TypeError: if an invalid initial_state is provided.
> """
>
> for prs, state_vector, cirq_order in self._simulate_impl(
> program, params, qubit_order, initial_state
> ):
627c651
< initial_state=qsim_state.view(np.complex64), qubits=cirq_order
---
> initial_state=np.complex64, qubits=cirq_order
Something is still consuming more GPU memory much more than in the past. I used to be able to do 33 qubits on a 2xA100 instance.
$ nvidia-smi
Thu Feb 8 00:07:04 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... Off | 00000000:00:04.0 Off | 0 |
| N/A 35C P0 63W / 400W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM... Off | 00000000:00:05.0 Off | 0 |
| N/A 36C P0 61W / 400W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
For 32 qubits we have a state vector of $2^{32}$ complex entries each of each is 2 float32 numbers or 8 bytes so that is $2^{35}$ bytes so we should expect usage of at least $2^{35}$ bytes or 32GB . the value in https://github.com/quantumlib/qsim/pull/623#issuecomment-1933141446 is 34182.4296875
MB or $~34.1$ GB. so we are only using $2.1$ GB more memory than the minimum necessary which I suppose is consumed by numpy overhead, other variables and maybe auxilary variables that will eventually be cleaned up by the garbage collector.
Are sure you could do 33 qubits on this machine?. The same calculation gives $2^{36}$ bytes or 64GB for 33 qubits. per https://www.aime.info/en/shop/product/aime-gpu-cloud-v242xa100/?pid=V28-2XA100-D1 2xA100 has only 40GB of ram per GPU.
Are sure you could do 33 qubits on this machine?
Yes, we are able to do so on qsimcirq==0.12.1
via the cuQuantum Appliance, which has a multi-GPU backend. Hence, 2x40 GB is more than enough for 64 GB requirement of 33 qubits.
I am in the process of measuring the max GPU memory consumption by polling nvidia-smi
in the background while the simulation is running, but this will take a while since I have terminated the instance and will have to wait until there is an open slot for the 2xA100 instance.
Update: all is good! I am able to run 33 qubits on the 2xA100 instance. I confirm this PR works.
The bug in the code in https://github.com/quantumlib/qsim/pull/623#issuecomment-1933141446 was that I forgot to specify
options = qsimcirq.QSimOptions(gpu_mode=2)
sim = qsimcirq.QSimSimulator(options)
My measurements (I'm not sure why the GPU memory is that low, but anyway, it works):
CPU only:
num_qubits 32
Elapsed 114.16660451889038
Peak GPU memory usage: 3 MiB
Max CPU memory 33086.91015625
GPU:
num_qubits 31
Elapsed 14.6939697265625
Peak GPU memory usage: 425 MiB
Max CPU memory 16830.81640625
num_qubits 32
Elapsed 28.458886861801147
Peak GPU memory usage: 425 MiB
Max CPU memory 33174.0078125
num_qubits 33
Elapsed 17.026336431503296
Peak GPU memory usage: 853 MiB
Max CPU memory 67345.63671875
The GPU memory is measured by reading the output of nvidia-smi --query-gpu=memory.used --format=csv
.
(I'm not sure why the GPU memory is that low, but anyway, it works)
My guess is that the time spent on the GPU is somewhat lower than the interval the nvidia-smi
measures the VRAM (0.01 s).
This is to avoid the numpy limit on the number of dimensions https://github.com/quantumlib/Cirq/issues/6031
The 1D representation should only be used when the number of qubits is greater than the numpy limit on the number of dimensions (currently set to 32) https://github.com/numpy/numpy/issues/5744.
fixes https://github.com/quantumlib/Cirq/issues/6031