Open BrunoLiegiBastonLiegi opened 1 week ago
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 99.94%. Comparing base (
68bdb98
) to head (d16f5e3
).
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
@BrunoLiegiBastonLiegi you don't have to use ascii characters for einsum
einsum
also provides an alternative way to provide the subscripts and operands aseinsum(op0, sublist0, op1, sublist1, ..., [sublistout])
. If the output shape is not provided in this formateinsum
will be calculated in implicit mode, otherwise it will be performed explicitly. The examples below have correspondingeinsum
calls with the two parameter methods.
https://numpy.org/doc/stable/reference/generated/numpy.einsum.html#numpy-einsum
In this way, you can use integers instead of characters, that is much better if you are generating the einsum
input (and you're not limited by the amount of characters).
(this is not a suggestion to ban characters for einsum
in every situation, since they are handy if they are a few, and you're writing them manually, as it's much more readable - but the case in which you spell out a small operation and the one in which you generate a complex one as an output have to be handled in different ways, and luckily NumPy offers both)
One way we could move forward is to return generators
for n >= 8
and let the user do its memory management by accessing the elements on the fly.
One way we could move forward is to return
generators
forn >= 8
and let the user do its memory management by accessing the elements on the fly.
mmh yeah, that's a possibility, but only in certain cases. For the sake of parallelization, surely having everything represented by a single tensor contraction is better, but you are limited by memory. Maybe, what we could do is defining the generator of a single element of the basis for each backend:
@cache
def _pauli_basis_element(self, i, nqubits):
# do things according to backend
and then you can build the complete basis as a generator as you suggested
def pauli_basis(...):
return (backend._pauli_basis_element(i, nqubits) for i in range(len))
but GPU wise this will be less efficient and it will work, anyway, as long as you don't need the complete basis at the same time.
One way we could move forward is to return
generators
forn >= 8
and let the user do its memory management by accessing the elements on the fly.mmh yeah, that's a possibility, but only in certain cases. For the sake of parallelization, surely having everything represented by a single tensor contraction is better, but you are limited by memory. Maybe, what we could do is defining the generator of a single element of the basis for each backend:
@cache def _pauli_basis_element(self, i, nqubits): # do things according to backend
and then you can build the complete basis as a generator as you suggested
def pauli_basis(...): return (backend._pauli_basis_element(i, nqubits) for i in range(len))
but GPU wise this will be less efficient and it will work, anyway, as long as you don't need the complete basis at the same time.
The Hilbert space is too big, unfortunately there's no way around that. I wouldn't spend much time on this since there are bigger priorities in terms of optimization.
In the end I was not able to make the numba
parallelized implementation to work, it was working with parallel=False
but there was no speed up in that case. Thus I just rolled back to using einsum
in every case. We are still getting a nice scaling with GPUs. I will get rid of the ascii charachters now.
Ok I ended up also adding a generalization to the qibo.quantum_info.superoperator_transformation.vectorization
, which now accepts batches of state vectors or density matrices as well, thus allowing to vectorize the whole basis in one shot rather than looping. In the process, I found a weird behaviour of tensorflow
. Namely, if you have a complex tensor the elements with zero real part, but still non zero imaginary part, are not recovered by nonzero
:
import numpy as np
import tensorflow as tf
import tensorflow.experimental.numpy as tnp
a = tf.Variable([0 + 1j, 1 + 0j, 1 + 1j])
tnp.nonzero(a)
# this finds the last two only
# [<tf.Tensor: shape=(2,), dtype=int64, numpy=array([1, 2])>]
np.nonzero(a)
# numpy gives the correct result
# (array([0, 1, 2]),)
This improves the construction of the pauli basis by moving everything to tensor notation and removing loops. Namely the
basis_full
is now constructed via contraction througheinsum
. This should also scale well with GPU backends, weirdly for standard numpy CPU there is no speedup. These are the results:The GPU always takes ~1s to set up. With 8 qubits,
old
gets killed together with the shell session, whereasnew
, both CPU and GPU, raise an out of memory error. To run 8 qubits you apparently need ~64GB of memory.To perform the einsum I use all the 48 ascii characters available, which means that we are limited toI am using integers as the indices for the48/3=16
qubits, unless other characters can be used in einsum. In any case, the memory requirements for 16 qubits are probably going to be very taxingeinsum
now, thus this is not limited anymore by the number of ascii charachters. It may be possible to obtain a speedup with numba as well, but I still have to investigate.EDIT 1: unfortunately
einsum
is not supported bynumba
, however if youjit
the old implementation:you are able to get a nice speedup ~ 5s for 7 qubits, however this happens from the second call, since the first time you are still dealing with the compilationunfortunately this implementation was yielding wrong results, thus I had to rollback to theeinsum
aprroach. Further investigation is needed to understand if it's possible to parallelize and improve this withnumba
.EDIT 2: At some point I realized that a possible bottleneck, or better inefficiency, was due to the
qibo.quantum_info.superoperator_transformation.vectorization
that allowed to be run on a single input only, either a state vector or a density matrix, thus forcing to run loops on each element of the basis (which can grow large very quickly). The impact on the runtime is still marginal for the cpu as for 7 qubits its contribute was around ~1-2s out 10s, but for GPUs this starts becoming relevant. Furthermore,vectorization
appears to be widely used in thequantum_info
module, which convinced me to generalize it to accept batches of state vectors or density matrices, lifting therefore the need of explicit loops and rather leveraging tensor primitives directly. This was applied to thepauli_basis
for now but has to be propagated throughout the whole module.Checklist: