Optimizations for the `qibo.quantum_info.basis.pauli_basis` and `vectorization` function

BrunoLiegiBastonLiegi commented 1 week ago

This improves the construction of the pauli basis by moving everything to tensor notation and removing loops. Namely the basis_full is now constructed via contraction through einsum. This should also scale well with GPU backends, weirdly for standard numpy CPU there is no speedup. These are the results:

nqubits    old (numpy)    new (numpy)    new (cupy)
4            ~ 1s            ~ 1s          ~ 2.7s
7           ~ 9.5s          ~ 9.1s         ~ 3.5s
8            killed          memory         memory

The GPU always takes ~1s to set up. With 8 qubits, old gets killed together with the shell session, whereas new, both CPU and GPU, raise an out of memory error. To run 8 qubits you apparently need ~64GB of memory.

To perform the einsum I use all the 48 ascii characters available, which means that we are limited to 48/3=16 qubits, unless other characters can be used in einsum. In any case, the memory requirements for 16 qubits are probably going to be very taxing I am using integers as the indices for the einsum now, thus this is not limited anymore by the number of ascii charachters. It may be possible to obtain a speedup with numba as well, but I still have to investigate.

EDIT 1: unfortunately einsum is not supported by numba, however if you jit the old implementation:

            basis_single = list(product(basis_single, repeat=nqubits))

            @njit(fastmath=True, parallel=True, cache=True)
            def _build_basis(basis_single):
                rows = [basis_single[0][0] for i in range(len(basis_single))]
                for i in prange(len(basis_single)):
                    row = basis_single[i][0]
                    for j in range(len(basis_single[i][1:])):
                        row = np.kron(row, basis_single[i][j])
                    rows[i] = row
                return rows

            basis_full = backend.cast(_build_basis(basis_single))

~~you are able to get a nice speedup ~ 5s for 7 qubits, however this happens from the second call, since the first time you are still dealing with the compilation~~ unfortunately this implementation was yielding wrong results, thus I had to rollback to the einsum aprroach. Further investigation is needed to understand if it's possible to parallelize and improve this with numba.

EDIT 2: At some point I realized that a possible bottleneck, or better inefficiency, was due to the qibo.quantum_info.superoperator_transformation.vectorization that allowed to be run on a single input only, either a state vector or a density matrix, thus forcing to run loops on each element of the basis (which can grow large very quickly). The impact on the runtime is still marginal for the cpu as for 7 qubits its contribute was around ~1-2s out 10s, but for GPUs this starts becoming relevant. Furthermore, vectorization appears to be widely used in the quantum_info module, which convinced me to generalize it to accept batches of state vectors or density matrices, lifting therefore the need of explicit loops and rather leveraging tensor primitives directly. This was applied to the pauli_basis for now but has to be propagated throughout the whole module.

Checklist:

[ ] Reviewers confirm new code works as expected.
[x] Tests are passing.
[x] Coverage does not decrease.
[x] Documentation is updated.

codecov[bot] commented 1 week ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 99.94%. Comparing base (68bdb98) to head (d16f5e3).

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #1459 +/- ## ======================================= Coverage 99.94% 99.94% ======================================= Files 81 81 Lines 11740 11748 +8 ======================================= + Hits 11733 11741 +8 Misses 7 7 ``` | [Flag](https://app.codecov.io/gh/qiboteam/qibo/pull/1459/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=qiboteam) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/qiboteam/qibo/pull/1459/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=qiboteam) | `99.94% <100.00%> (+<0.01%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=qiboteam#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

alecandido commented 1 week ago

@BrunoLiegiBastonLiegi you don't have to use ascii characters for einsum

einsum also provides an alternative way to provide the subscripts and operands as einsum(op0, sublist0, op1, sublist1, ..., [sublistout]). If the output shape is not provided in this format einsum will be calculated in implicit mode, otherwise it will be performed explicitly. The examples below have corresponding einsum calls with the two parameter methods.

https://numpy.org/doc/stable/reference/generated/numpy.einsum.html#numpy-einsum

In this way, you can use integers instead of characters, that is much better if you are generating the einsum input (and you're not limited by the amount of characters).

(this is not a suggestion to ban characters for einsum in every situation, since they are handy if they are a few, and you're writing them manually, as it's much more readable - but the case in which you spell out a small operation and the one in which you generate a complex one as an output have to be handled in different ways, and luckily NumPy offers both)

renatomello commented 5 days ago

One way we could move forward is to return generators for n >= 8 and let the user do its memory management by accessing the elements on the fly.

BrunoLiegiBastonLiegi commented 5 days ago

One way we could move forward is to return generators for n >= 8 and let the user do its memory management by accessing the elements on the fly.

mmh yeah, that's a possibility, but only in certain cases. For the sake of parallelization, surely having everything represented by a single tensor contraction is better, but you are limited by memory. Maybe, what we could do is defining the generator of a single element of the basis for each backend:

@cache
def _pauli_basis_element(self, i, nqubits):
    # do things according to backend

and then you can build the complete basis as a generator as you suggested

def pauli_basis(...):
    return (backend._pauli_basis_element(i, nqubits) for i in range(len))

but GPU wise this will be less efficient and it will work, anyway, as long as you don't need the complete basis at the same time.

renatomello commented 4 days ago

One way we could move forward is to return generators for n >= 8 and let the user do its memory management by accessing the elements on the fly.

mmh yeah, that's a possibility, but only in certain cases. For the sake of parallelization, surely having everything represented by a single tensor contraction is better, but you are limited by memory. Maybe, what we could do is defining the generator of a single element of the basis for each backend:
@cache
def _pauli_basis_element(self, i, nqubits):
    # do things according to backend
and then you can build the complete basis as a generator as you suggested
def pauli_basis(...):
    return (backend._pauli_basis_element(i, nqubits) for i in range(len))
but GPU wise this will be less efficient and it will work, anyway, as long as you don't need the complete basis at the same time.

The Hilbert space is too big, unfortunately there's no way around that. I wouldn't spend much time on this since there are bigger priorities in terms of optimization.

BrunoLiegiBastonLiegi commented 4 days ago

In the end I was not able to make the numba parallelized implementation to work, it was working with parallel=False but there was no speed up in that case. Thus I just rolled back to using einsum in every case. We are still getting a nice scaling with GPUs. I will get rid of the ascii charachters now.

BrunoLiegiBastonLiegi commented 3 days ago

Ok I ended up also adding a generalization to the qibo.quantum_info.superoperator_transformation.vectorization, which now accepts batches of state vectors or density matrices as well, thus allowing to vectorize the whole basis in one shot rather than looping. In the process, I found a weird behaviour of tensorflow. Namely, if you have a complex tensor the elements with zero real part, but still non zero imaginary part, are not recovered by nonzero:

import numpy as np
import tensorflow as tf
import tensorflow.experimental.numpy as tnp 

a = tf.Variable([0 + 1j, 1 + 0j, 1 + 1j])
tnp.nonzero(a)
# this finds the last two only
# [<tf.Tensor: shape=(2,), dtype=int64, numpy=array([1, 2])>]
np.nonzero(a)
# numpy gives the correct result
# (array([0, 1, 2]),)

qiboteam / qibo

Optimizations for the `qibo.quantum_info.basis.pauli_basis` and `vectorization` function #1459

Codecov Report