Simulation performance improvements

While investigating why the Qulacs backend does not seem to use multiple threads in my code, I noticed that very little time is spent in the actual simulation.

For testing, I used the Quantum Fourier Transform. By running either just a QFT, or a QFT and then an inverse QFT, we can make the resulting wavefunction dense or sparse, which is important for testing the performance.

Code

```python import tequila as tq from math import pi import time SIZE = 20 def qft(size: int, inverse: bool = False) -> tq.QCircuit(): U = tq.QCircuit() for i in range(size): U += tq.gates.H(target=i) for j in range(size - i - 1): U += tq.gates.Phase(target=i + j + 1, angle=(-1 if inverse else 1) * pi / 2 ** (j + 2)) U += tq.gates.CRz(target=i, control=i + j + 1, angle=(-1 if inverse else 1) * pi / 2 ** (j + 1)) return U U = qft(SIZE) V = qft(SIZE) + qft(SIZE, inverse=True) start = time.time() wfn = tq.simulate(U, backend="qulacs") print(f"QFT time: {time.time() - start} s") start = time.time() wfn = tq.simulate(V, backend="qulacs") print(f"QFT + inverse QFT time: {time.time() - start} s") ```

Running this code on the devel branch takes ~160 seconds for the normal QFT and ~20 seconds for the combined QFT and inverse QFT on my machine. Note that the second one is much faster even though the simulated circuit is twice as large, this is because the result is a sparse wavefunction. When measuring the time of self.circuit.update_quantum_state(state) in do_simulate in simulator_qulacs.py (which I believe is the actual simulation), it's only ~0.17s / ~0.28s (QFT / QFT + inverse), so there is a lot of room for speedups. All the times here are simple measurements of single runs, and not proper benchmarks, but the differences are so large that it shouldn't matter if the numbers are inaccurate.

I made four commits which speed this up:

For a dense wavefunction, a lot of time is spent in the apply_keymap function in simulate. This doesn't seem necessary when all qubits are active, since then the mapping has no effect. So I added a check that skips the keymap entirely in that case, reducing the runtime to ~24s / ~20s.
The numpy.isclose call in from_array in qubit_wavefunction.py is expensive, likely because it checks for edge cases (infinity, nan) and because it can handle arrays. Assuming we simply want to skip values close to zero, this can be done much faster using a simple comparison. Also, we only need to create the index bitstring when we actually want to set the value, so moving it inside the if clause saves a lot of time for sparse wavefunctions. These two changes reduce the runtime to ~6s / ~0.7s.
Still in the from_array function, if the numbering of the array doesn't match the numbering of the wavefunction, a keymap is applied to adjust it. This should not be necessary because the initialize_bitstring function should already return the correct type of bitstring. However currently the returned type depends on the numbering_in parameter, changing it to the numbering_out parameter allows removing the keymap and reduces the time to ~3.7s / ~0.6s.
Reversing the bits in a bitstring is currently done by converting the integer to a binary string, reversing the characters, and converting them back to an integer. This is inefficient and can be replaced with a function that uses bit operations. An interpretation of this code is that each line flips a bit in the position of each bit, so when all position bits are flipped, the position p is moved to 31 - p. This assumes the integer has exactly 32 bit, so if it has less, this adds zeros which we can get rid of with a bitshift. Another small optimization is to calculate the logarithm to get the bitstring length, instead of converting to a binary string and counting (the integer bit_length() method can't be used because some code passes Numpy ints). Together, this reduces the runtime to ~2.5s / ~0.6s.

In total, the speedup is ~60x / ~30x for the script above, but is likely much less for other code. I have not tested if these speedups transfer to other backends. Further improvements are definitely possible, but I think I have covered a good amount of the low hanging fruit.

tequilahub / tequila

Simulation performance improvements #359