Add support to `cirq.unitary(val)` to compute a reduced unitary when `val`s decomposition can allocate new qubits

Is your feature request related to a use case or problem? Please describe. As part of providing tools for allocating / deallocating qubits in https://github.com/quantumlib/Cirq/issues/6040, we want to support gates which can allocate ancilla qubits (and clean them up) as part of their _decompose_ method (eg: related https://github.com/quantumlib/Cirq/issues/6081).

Right now, cirq.unitary(val) would fail when called on an operation / gate which allocates new qubits. We should fix this.

As part of https://github.com/quantumlib/Cirq/issues/6040, we always assume that any dirty / clean ancilla qubits allocated as part of the _decompose_ method will be correctly cleaned up within the _decompose_ method so that no new dirty qubits become part of the system outside of the OP_TREE yielded by _decompose_. Using this assumption, we can extend the cirq.unitary(val) implementation such that if val does not define a _unitary_ protocol, we can use it's decomposition to compute a reduced unitary that acts only on val.qubits, assuming the op-tree yielded by cirq.decompose(val) is unitary.

Describe the solution you'd like

A simple implementation showing such an extension to cirq.unitary is given as follows:

def compute_unitary(op: cirq.Operation):
    """Computes the reduced unitary, when the decomposition of op can allocate new ancillas."""
    qubit_order = cirq.QubitOrder.explicit(op.qubits, fallback=cirq.QubitOrder.DEFAULT)
    circuit = cirq.Circuit(cirq.decompose(op))
    nq, nall = len(op.qubits), len(circuit.all_qubits())
    assert nall <= 13, "Too many qubits to compute the reduced unitary for."
    u = circuit.unitary(qubit_order=qubit_order).reshape((2,) * 2 * nall)
    new_order = [*range(nq, nall), *range(nall + nq, 2 * nall), *range(nq), *range(nall, nall + nq)]
    u = np.transpose(u, axes=new_order)
    u = u[(0, 0) * (nall - nq)]
    return u.reshape(2**nq, 2**nq)

class GateWithDecompose(cirq.Gate):
    def _num_qubits_(self):
        return 1
    def _decompose_(self, q):
        anc = cirq.NamedQubit("anc")
        yield cirq.CX(*q, anc)
        yield (cirq.Z)(anc)
        yield cirq.CX(*q, anc)

op = GateWithDecompose().on(*cirq.LineQubit.range(1))
print(compute_unitary(op))

and the output, as one would expect, is

[[ 1.+0.j  0.+0.j]
 [ 0.+0.j -1.+0.j]]

Note that this would work when GateWithDecompose allocates both clean and dirty qubits because for a) for clean qubits; we are interested only in the 000... 0 subspace (i.e. where all input/output ancillas are zeros) and for dirty qubits picking any combination of suspace is valid (because the input dirty ancillas can be in any state and the decomposition promises to leave them back in the same state).

This sample implementation works assuming that the decomposition is correct and performs no validation to verify that the decomposition indeed cleans up the newly allocated clean / dirty qubits. We can also consider extending the implementation above to perform a correctness check and raise an error if decomposition is not valid -- this is relatively straightforward for borrowed dirty qubits but I think it'd be tricky for clean qubits.

What is the urgency from your perspective for this issue? Is it blocking important work?

P0 - this should land no later than a week

cc @daxfohl @NoureldinYosri

checking that the final $u$ array is indeed a unitary is necessary and sufficient for clean qubits (starting from $\ket{0 \cdots 0}$ subspace we always return to it with probability 1).

for borrowable qubits it's necessary but not sufficient since we also need to check that the unitary after transpose is a tensor product $I_{m} \otimes u$.

A few comments

Terminology

I recommend against the term "reduced unitary" since it's inconsistent with the use of the adjective "reduced" for quantum states. Reduced state of a ket is typically not a ket, but a density matrix. By analogy, a reduced unitary would not be a unitary, but a channel. However, in this case this would be misleading since we're interested in precisely the case where the resulting operation actually is a unitary. Narrow and wide might be better alternatives.

Decomposition

The way we currently describe decompositions using _decompose_ does not provide all the information about a decomposition that's needed for effective and fast implementation of this feature. For example, _decompose_ fails to provide information about whether or not it restores the auxiliary qubits to the initial state. Now, we could check this in _unitary_ but this makes this a recurring cost and is contrary to cirq philosophy wherein we define special-purpose asserts, such as assert_consistent_channel, which verify conditions in tests so they can be relied upon at runtime. Ideally, we'd define something like assert_restores_borrowed_qubits etc, so we could make assumptions safely elsewhere.

It would be neat if the tests could determine which decompositions are meant to restore the auxiliary qubits and which aren't so that the decision whether to invoke assert_restores_borrowed_qubits could be made by tests automatically. Perhaps via something like def restores_borrowed_qubits(self) -> bool? For example, CNOT, Z2, CNOT does not restore the second qubit to its original state, but CNOT, Z2, CNOT, Z2 does. Hence, we can borrow any qubit when using the latter, but we can only borrow qubits known to be in a computational basis state when using the former.

Validation

@NoureldinYosri I think your conditions are roughly correct, but what do you mean by "after transpose"?

Consider a circuit $C_{RA}$ acting on an $n$-qubit register $R$ and a $k$-qubit auxiliary register $A$. There are two cases: arbitrary state of the auxiliary qubits and a fixed state of the auxiliary qubits.

In the first case, $C_{RA}$ is guaranteed to effect unitary $UR$ on $R$ if and only if $$C{RA} = U_R\otimes V_A$$ where $VA$ is some unitary that $C{RA}$ effects on the auxiliary register $A$. If we want to ensure that $C_{RA}$ returns the auxiliary qubits $A$ to the original state without knowing that state, then we also have to check that $V_A$ is the identity.

In the second case, $C_{RA}$ may be an entangling unitary. It effects a unitary $U_R$ on $R$ if and only if, for some fixed states $|\psi_A\rangle$ and $|\psi_A'\rangle$, it sends every product state $|x_R\rangle|\psi_A\rangle$ to a product state $|x_R'\rangle|\psi_A'\rangle$. This happens if and only if $U_R=\langle\psiA|C{RA}|\psi'_A\rangle$ is unitary.

I think it is important that we check that $U_R$ is a unitary (using cirq.is_unitary) and if auxiliary qubits aren't in a fixed known state and we expect them to be preserved then also that $V_A$ is the identity. If the cost is prohibitive then we should do this in tests rather than at runtime.

Performance

The cost of multiplying two $m\times m$ matrices is $O(m^3)$. Therefore, the proposed implementation is quite expensive: $O(8^{n+k})$. In particular, it would dwarf the cost of calling cirq.is_unitary on $U_R$ to perform one of the checks above.

However, we can compute $U_R$ faster without calling Circuit._unitary_ (which may multiply $2^{n+k}\times 2^{n+k}$ unitaries of all the moments in the decomposition).

Notice that if $C_{RA}=U_R\otimes V_A$ then $U_A\equiv\langle\psiA|C{RA}|\psi_A\rangle$ up to irrelevant global phase for any pure state $|\psi_A\rangle$ of the auxiliary register. This leads to the following optimization: suppose _decompose_ returns only a single gate $G_{RA}$ that acts on both $R$ and $A$. Then we can set $|\psi_A\rangle:=|0_A\rangle$ and instead of multiplying matrices, we can compute $\langle 0A|G{RA}|0A\rangle$ (which is a $2^n\times 2^n$ matrix), by simply selecting the appropriate rows and columns from the matrix of $G{RA}$. Even if _decompose_ returns more operations, we can still perform a variant of this optimization. Suppose $G$ is the first operation returned by _decompose_. Then we can compute $G_{RA}|0A\rangle$ (which is a $2^{n+k}\times 2^n$ matrix) without matrix multiplication by selecting the appropriate columns of $G{RA}$. Similarly, if $H_{RA}$ is the last operation, we can compute $\langle 0A|H{RA}$ (which is a $2^n\times 2^{n+k}$ matrix) without matrix multiplication by selecting the appropriate rows. In fact, the optimization can be applied to even more operations. Namely, for every auxiliary qubit $q$ in $A$, we can apply this optimization to every operation which is first or last to act on $q$ and $R$.

I don't know how long typical relevant decompositions are, but if every auxiliary qubit is touched by one or two gates then the above idea enables us to completely avoid multiplying $2^{n+k}\times 2^{n+k}$ matrices.

(This optimization can also be described as a tensor network contraction.)

@viathor at the moment we only have 2 types of auxillary qubits clean and borrowable. if a _decompose_ uses a clean qubit it has to restore it to zero state and if it uses a borrowable state then it has to restore it to its original -unknown- state.

In particular we don't support allocating a clean qubit but leaving it in a dirty state or allocating a borrowable qubit and then leaving it in a state different from its initial state.

so we have to enforce assert_restores_borrowed_qubits and assert_restores_clean_qubits in tests. a gate that doesn't restore its ancillas to their initial state risks leaving them in an entagled state with gate qubits, so we can't make any assumptions about them or safely reuse them. Maybe we need a DiscardableQubit concept to cover qubits that will only be used by the given operation and discarded afterwards, never to be reused?

By transpose I mean reordering the matrix/qubits to the correct order.

Since the original state of a borrowable qubit is not known we need only to care about the $V_A = I$ case. If the decomposition really has $V_A \neq I$ for some reason then it's the same as allocating clean qubits preparing them in the desired basis and then clean them up as part of the decomposition so that what the unitary protocol would see is an indentity.

From what I have seen so far most auxiliary qubits are used in at least 3 operations. a preparation operation, a work operation and a clean up operation. I don't think we can get away with these optimizations though they will certainly help.

@tanujkhattar how many qubits do we need to support? The sample implemention you provide asserts that the total number of qubits is less than 13. Is this the intended limit on qubits for this case?

@viathor the $\mathcal{O}(8^{n+k})$ complexity is the current runtime of cirq.unitary except that decompose didn't allocate new qubits but are given all qubits beforehand. i think what we need is to use some heuristic so that we don't always end up in the worst case scenario. One heuristic could be that most ancillas interact with only a couple of other qubits so that $C_{RA}$ is potentially sparse.

@NoureldinYosri Your heuristic suggests that_decompose_ often fails to make all qubits interact, so that the composite system $RA$ splits into subsystems $E$ and $F$ such that $C_{RA}=U_E\otimes U_F$. If this is indeed the case, then we want to exploit this tensor product structure, not sparsity.

We could either modify _decompose_ API so that it provides the information about the partitioning $RA=EF$ or we could implement a utility that finds $E$ and $F$ and then call cirq.unitary on $E$ and $F$ separately. To write such a utility we treat qubits as nodes and two-qubit gates as edges in a graph and look for connected components, e.g. using Union-Find. In practice, we should support arbitrary number of connected components $E_1,\dots,E_m$, not just two $E$ and $F$.

@viathor not quite. what I mean is that the ancillas are functions of the input qubits. take for example the And gate that uses only 4 T gates (rather than 8) at the cost of one ancilla from https://algassert.com/post/1903. the ancilla will hold and value of the two input qubits so although the decomposition will have 3 qubits (so that there are 8 possible state) only 4 states will be reachable. That construction can be extended to compute the and of $n$ qubits using $4n$ T gates (rather than $8n$) using $n-1$ ancillas. in that case the decomposition will have $2n-1$ qubits so that there will be $2^{2n-1} = \frac{1}{2} 4^n$ leading to a unitary of size $\approx 16^n$, however the space of states will still have size only $2^n$ . This is what I mean by the unitary being sparse. because a lot of rows (e.g. the impossible states) will have only a few non zero values. As a graph each of the qubits will be connected to at most 3 other qubits, however there will be exactly one connected component. Generally it won't be possible to factor the unitary as a tensor product.

there are four questions attached to this issue

Can cirq.unitary support val that can allocate new qubits?
Can it support the situtation where some of the operations in the decomposition are not unitary but the overall effect on the work qubits (not the ancillas) is unitary (e.g. erasing entanglement using measurement)?
Is it efficient?
Do we have consistency and correctness checks?

the simple implementation provided in the issue description fullfills only the first question. however since this issue is blocking the release of the cirq-ft I think it makes sense to start with the basic support to unblock cirq-ft and then iteratively improve the implementation to answer the other questions.

@NoureldinYosri I'd like to explicitly call out an assumption that we make as part of #6112

When the apply_unitaries protocol attempts the _strat_apply_unitary_from_decompose(val, args); we always assume that any newly allocated ancilla is a "new" uninitialized qubit that starts in the zero state. This assumption is not always true; for example if you have a [g1(a, b), g2(a)] where g2 requests a new dirty ancilla as part of it's decompose and the qubit manager decides to give it qubit b; then in _strat_apply_unitary_from_decompose the all_qubits=[a, b] and ancilla=[b]. Technically, the args that are passed to apply_unitary() method would already contain space corresponding to qubit b but we'd still end up allocating new space for b because in this method; the global qubit map is not present and therefore there's no way for _strat_apply_unitary_from_decompose to know that apply_unitaries method had already allocated space for qubit a.

Now, this works fine for the apply_unitary protocol ONLY because we assume that any gate that allocates a new clean / dirty ancilla as part of _decompose_ will leave it in the correct state. As discussed earlier, we can add validations in follow-up PRs to assert that it's indeed the case but I think this pattern is important to watch out for when extending support at other places like simulators.

https://github.com/quantumlib/Cirq/blob/1ed879d85683c5db7d26ab6956609cde77be39a5/cirq-core/cirq/protocols/apply_unitary_protocol.py#L485

cc @senecameeks I think this will become more important for your simulators PR since simply doing a args.add_qubits(ancillas) and args.remove_qubits(ancillas) is not entirely correct given that ancillas can contain qubits that are already tracked by SimulationState. The assumption that ancillas will always only contain new qubits is not correct and the current implementation in #6108 seems to make that assumption.

quantumlib / Cirq