ncrubin commented 2 years ago

I put up a branch of the start of my subprepare to demonstrate where we need to improve the existing primitives.

SUBPREPARE involves a lot of different primitives where logic is provided by the user. A question about how to allocate the ancilla needed for each step needs to be worked out.

For example, the oracle has 4 main components: 1) Uniform_L, alt_keep_qrom, LessThanEqual, and coherent swap. Generically, the input and output registers size is 2 * logL + 2 mu + 1. But each primitive uses some number of ancilla qubits. Uniform uses 1 ancilla, QROM should use len(select_register) qubits (for the AND), and the LessthanEqual takes some number of ancilla depending on implementation. A naive estimate on maximum number of qubits to take is just the sum of all those ancilla plus the original registers. This will likely blow up the total number of qubits.

For testing this circuit the current settings use mu == 4 select register size =3 and thus the total input/output register is 15 qubits. But counting ancilla I won't be able to even simulate the output of this small example.

tanujkhattar commented 2 years ago

I don't see the sub prepare implementation in https://github.com/ncrubin/cirq-qubitization/tree/subprep; can you point me to the file?

ncrubin commented 2 years ago

My bad, forgot to commit the change

https://github.com/ncrubin/cirq-qubitization/blob/subprep/cirq_qubitization/subprepare.py

tanujkhattar commented 1 year ago

https://github.com/ncrubin/cirq-qubitization/pull/24 adds an implementation of subprepare that doesn't need a qubit manager. I've updated the title to reflec that we'd eventially need a qubit manager which can allocate / deallocate clean and dirty qubits.

For now, I'm curious to see how far we can get without the need of a qubit manager.

NoureldinYosri commented 1 year ago

@tanujkhattar @ncrubin how do you think the usage of the qubit manager will be?

there is I think 2 ways

a pythonic way
```
qm = cirq.QubitManager()
```

with qm.Qubits(3) as ancilla: # this returns 3 new qubits and recovers them at the end of the context

some code

with qm.Qubits(4) as ancilla2: # 4 clean qubits in a context within context

...

2. a C/C++ like way

qm = cirq.QubitManager() ancilla = qm.New(3) .... qm.Return(ancilla)


3. the two ways together similar to C#, but this opens a road for memory leaks because python is different from C#

An alternative approach is instead of a qubit manager a qubit estimator/optimizer protocol is created (e.g. `protocols.QubitOptimizer(circuit)`) which does the neccassry work to optimize the number of qubits used (e.g. returns a new circuit that minimizes the number of qubits), minimizing the number of qubits will result in a circuit of [maximum depth](https://github.com/microsoft/qsharp-runtime/blob/main/src/Simulation/Simulators/QCTraceSimulator/Docs/Width%20and%20Depth%20in%20the%20Tracer.docx) so the protocol can also take a input `minimum_depth` which  tells the optimizer to create a circuit of at least this `minimum_depth` (though the optimizer should ignore it if it's larger than the maximum possible depth of the input circuit)

the protocol approach will probably be useful in implementing #32

tanujkhattar commented 1 year ago

Way-2 looks better to me. The nested contexts would get very difficult to manage as the number of times we want to allocate qubits increase.

Also, I would consider just having method calls: qallocate, qfree, to allocate and free qubits instead of using a qubit manager class, unless we believe users can implement different allocation strategies which would be helpful.

cc @mpharrigan What do you think?

NoureldinYosri commented 1 year ago

@tanujkhattar simple qallocate, qfree methods mean that we effectively will be using global variables to track what qubit is used by whom

tanujkhattar commented 1 year ago

We can have an internal class that does the tracking and qallocate / qfree methods would dispatch the function calls to the singleton object of that class. The API would be closer to malloc / free in c/c++.

tanujkhattar commented 1 year ago

An alternative approach is instead of a qubit manager a qubit estimator/optimizer protocol is created (e.g. protocols.QubitOptimizer(circuit)) which does the neccassry work to optimize the number of qubits used (e.g. returns a new circuit that minimizes the number of qubits), minimizing the number of qubits will result in a circuit of maximum depth so the protocol can also take a input minimum_depth which tells the optimizer to create a circuit of at least this minimum_depth (though the optimizer should ignore it if it's larger than the maximum possible depth of the input circuit)

Regarding this approach, there are two problems: 1) Users would still need to construct the circuit before giving it to an optimizer. As the number of qubits required grows, circuit construction is going to be non-trivial as users would need to keep track of allocated qubits themselves. 2) From the Q# document, it seems to me that the optimizer would basically have two modes: One where the qubit manager has EnableQubitReuse=True, which optimizers for width but not depth. And other where EnableQubitReuse=False, which optimizes for depth and not width. Note that we don't expect to dynamically change operations / decompositions of operations in the Circuit to optimize one or the other (eg: use more ancillas to reduce depth and use less / no ancillas to optimize for width).

My hunch is that this approach, of just toggling the EnableQubitReuse=False flag on qubit manager, would probably not be very efficient for optimizing for depth because most of the times, the ancillas qubits are the only ones which are "used" and "freed" after an operation. And most operations operate on both system qubits + ancillas. The operations on system qubits would still induce the dependencies that lead to similar depths. Toggline the flag would to disallow qubit reuse would mostly just end up using new ancillas for each operation, which doesn't sound like a very good strategy to me.

We can consider making QubitManager a class that supports both the options, and then benchmarking on real circuits to see the results. But my intuition is that most of the times we'd want to use EnableQubitReuse=True by default.

mpharrigan commented 1 year ago

Point of order: qalloc/qfree is a C paradigm whereas context manager is identical to C++'s RAII (namely, smart pointers)

An important use case is static analysis that qubits are being deallocated correctly. Context managers is how Q# does it and is pretty straightforward and should be seriously considered. It's annoying because it doesn't really solve the problem:

class ExampleGate:
  def decompose(control, target):
    ancilla = qalloc()
    yield cnot(control, ancilla)
    yield cnot(ancilla, target)
    # note: make sure you do the adjoint of this later to clean up ancilla

would need to be rewritten as a wrapper gate that does ExampleGate, some parameteric inner body, and ExampleGate.inverse() within an allocation block and then ExampleGate will once again need to take an ancilla register as input.

I have a sketchy idea in my brain where allocations become nodes in the compute graph (like operations are now). The static analyzer can have rules:

unitary operations have equal numbers of input and output wires
you can declare a register/wire as a "control" wire and the static analyzer (quantum type checker) can auto-test that they're not modified
allocation operations have output wires only and must be paired with a deallocation operation somewhere else in the compute graph

https://docs.google.com/document/d/1OBH2YiScpJzoSaExpu3LenfzU8r4a79Bvs0gx7tYgwg/

A lot of details to be worked out, however

mpharrigan commented 1 year ago

How will qubit allocation and de-allocation be displayed in circuit diagrams? The ability to diagram our gates has been very helpful; but is very strongly coupled to Cirq's lack of qubit allocation and deallocation

mpharrigan commented 1 year ago

Consider the AND gate. Right now, we hope that the target will be in the 0 state. It would be nicer for the AND gate to take its inputs (i.e. the control wires) and allocate the target itself and then return it. decompose doesn't have any ability to return these allocated qubits.

mpharrigan commented 1 year ago

60 is an initial attempt at this, but I'd like to see a bit of a description of how it will work in context. Some questions:

how will decompose methods return the allocated qubits for use in further gates?
how will a resource counter track the circuit-width-over-time
how could the structure conceivably be integrated into checks that qubits are cleaned up correctly?

NoureldinYosri commented 1 year ago

how about this https://arxiv.org/abs/2210.03680 ?, the section clause is easy to implement, the parallel_sections clause will be tricky to implement but doable, and I think the parallel_for will be very similar to parallel_sections.

I can think of two ways to implement this, one with contexts and another by creating a CircuitBuilder, but for the context implementation we will need another clause that indicates a program.

from contextlib import contextmanager
import cirq

free_ancillas = []
used_ancillas = []
number_ancillas = []

@contextmanager
def program(name: str=''):
    global free_ancillas, used_ancillas, number_ancillas
    try:
        free_ancillas.append([])
        used_ancillas.append([])
        number_ancillas.append(0)
        yield 
    finally:
        free_ancillas.pop()
        used_ancillas.pop()

@contextmanager
def section():
    global used_ancillas
    try:
        used_ancillas.append([])
        yield
    finally:
        free_ancillas[-1].extend(used_ancillas.pop())  # free qubits

def qalloc():
    if len(free_ancillas[-1]) == 0:
        idx = number_ancillas[-1]
        number_ancillas[-1] += 1
        free_ancillas[-1].append(cirq.NamedQubit(f'ancilla{idx}'))
    q = free_ancillas[-1].pop()
    used_ancillas[-1].append(q)
    return q

ops = []
with program(name='sample'):
    with section():
        a, b, c = cirq.LineQid.range(3, dimension=2)  # algorithm qubits can be allocated in the normal way
        ops.append(cirq.H(a))
        ops.append(cirq.H(b))

    with section():
        ancilla = qalloc()  # create ancilla, could be inside a decompose or constructor of some object.
        ops.append(cirq.CNOT(a, ancilla))
        ops.append(cirq.CNOT(ancilla, c))

    with section():
        ancilla = qalloc()  # same ancilla will be reused
        ops.append(cirq.CNOT(b, ancilla))
        ops.append(cirq.CNOT(ancilla, c))

c = cirq.Circuit(ops)
print(c)

#  In _test.py
with program(name='test'):
    """create expected object in unit tests"""

the printted circuit is

0 (d=2): ────H───@───────────────
                 │
1 (d=2): ────H───┼───────@───────
                 │       │
2 (d=2): ────────┼───X───┼───X───
                 │   │   │   │
ancilla0: ───────X───@───X───@───

Pros:

automatic management of ancillas
easy to use
compatible with cirq

Cons:

we don't do a check on the qubits being deallocated to make sure they are clean or unentangled
- we can do it like Q# and add the check as part of the simulator.
- we can also add a boolean that allows the user to specify that they don't want this/these qubit(s) to be reused.
The context implementation uses global variables. if we don't want global variables we will need to create a builder class which we will need to pass to every function that allocats qubits.
- A middle ground between these two ideas exist with the program context creating a builder which will be the only global object as follows:

@contextmanager
def program(name: str=''):
    global builder
    old_builder = builder
    try:
        builder = Builder()       
        yield builder
    finally:
        del builder  # free memory used.
        builder = old_builder

def qalloc():
    global builder
    free_ancillas = builder.free_ancillas
    if len(free_ancillas[-1]) == 0:
        idx = number_ancillas[-1]
        number_ancillas[-1] += 1
        free_ancillas[-1].append(cirq.NamedQubit(f'ancilla{idx}'))
    q = free_ancillas[-1].pop()
    used_ancillas[-1].append(q)
    return q

@tanujkhattar @mpharrigan

mpharrigan commented 1 year ago

I guess I should actually read that qparallel paper instead of just reading the abstract and sending it off into the chat.

I don't understand a-priori what the context managers buy you here

mpharrigan commented 1 year ago

ok I get it now. I guess I wasn't considering how pathological a naive qubit manager could be 😂

in any event: it seems weird to have to add new "syntax". I would have assumed each Gate defines a parallel_section scope and each sub-operation is an implicit section

mpharrigan commented 1 year ago

In their example:

class CRx(Gate):
  def decompose(a, b):
    c = alloc()
    yield from [swap, rz, swap]
    dealloc(c)

class MultiCRx(Gate):
  def decompose(bunch_o_qubits):
    for a, b, in zip(bunch[::2], bunch[1::2]):
      yield CRx().on(a,b)

circuit = cirq.Circuit(MultiCRx())

how would the qubit-counter protocol work? Would it unroll everything into a circuit and then look at the number of qubits? no, right? It would go moment-by-moment and query the MultiCRx for its max-width. Which (in my naive world) would be the sum of all its sub-operations max-width.

I don't know... I don't have a good mental model for how the proposed qubit allocator would work with the qubit counter.

NoureldinYosri commented 1 year ago

having sections is different but related to having moments, the section concept signals to the qubit manager (whatever that maybe) that it can't reuse ancillas within a given section, at the end when the circuit is constructed the sections will disappear (the circuit will be a normal cirq circuit) but their effect on ancilla allocation will remain.

In my opinion, there are two things at play here

regardless of how one builds the qubit manager they will always need a way to signal to it
1. how many ancillas to allocate and when to deallocate them, preferably in a user friendly way.
2. when it can or can't reuse ancillas
3. whether the ancillas at the end will be clean or dirty
Cirq is built with objects at its heart, while it is not strictly object oriented per se, almost every concept (apart from the protocols) is represented by an object, on the other hand Q# and by the looks of it (though I can't definitely say) are imperative (i.e. C-like) languages, borrowing ideas written for Q# may lead to a paradigm as well as a style shift.

Having said that I think using contexts is the simplest way to do this signalling + it's pythonic

As for your example the issue is that decompose expects the qubits to be already allocated. I don't see an easy way to reuse qubits after allocation unless we make the simulator (or a seprate new component) works in a similar way to Q# tracer which can either optimize for width or depth.

optimizing for width is easy, we just need to allocate the minimum number of ancillas we need $\geq$ max(#ancillas required by any operation) , with equality if all operations leave their ancillas clean.

optimizing for depth is done by Q# using a greedy approximation algorithm as explained here, however I think this problem can be solved exactly since this problem boiles down to computing the minimum edge disjoint path cover of a certain directed graph, this can be solved in $\mathcal{O}(V^{2.5})$ where $V$ is the number of ancillas in the original circuit.

I was thinking along those lines several months ago, but after a discussion with @tanujkhattar changed my direction since we decided to try to build something that gives users more control/flexibility.

mpharrigan commented 1 year ago

As for your example the issue is that decompose expects the qubits to be already allocated.

No, I use alloc() and dealloc() in CRx definition. This example is taken directly from the qparallel paper.

I agree that you need to communicate reusability blocks to the qubit manager. Q# seems to default to aggresive reuse. I'm suggesting exploring: no re-use by default within a given decompose() definition. Then instead of having a bespoke type of "parallel section" we can use an existing mechanism for grouping sub operations: namely, defining a gate+decompose().

I also think we need to have a very clear understanding of how we'll do resource counting without fleshing out the fully-decomposed circuit. If we have thousands of qubits (or more!) it would be rough to have to create a 1000+-qubit high-depth cirq.Circuit() to get accurate qubit counts.

NoureldinYosri commented 1 year ago

forget this

~~Having said that I think using contexts is the simplest way to do this signalling + it's pythonic~~

how about we represent the section idea with a class which will be exactly the same as a moment except that it can be nested (section within section), this way the _decompose_ of a section will have the necessary information for qubit reuse.

Without agreeing on a way to do qubit allocation/reuse we can't really know how to count them

mpharrigan commented 1 year ago

Without agreeing on a way to do qubit allocation/reuse we can't really know how to count them

This is true! It partially goes the other way as well: without knowing how we want to count them, we can't design a qubit allocator. That's why I'd like to have a sketch of how both will work together.

class which will be exactly the same as a moment except that it can be nested

We already have nestable groups in Cirq and I'm wary of adding more. CircuitOperation and Gate.decompose() already allow nesting

tanujkhattar commented 1 year ago

One of the main problems why simply having a "parallel_sections" concept (or any equivalent concept using cirq abstractions) is not enough is because during actual circuit construction, we'd often need to switch between a "parallel section" and a "sequential section" with the following properties: 1) A parallel section corresponds to a single cirq moment. Therefore, no qubits should be shared across the top-level operations within a parallel section. 2) A sequential section represents a sequence of cirq moments. In this case, a qubit manager should try to maximize reuse of qubits across operations within the same sequential section.

Note that the parallel and sequential sections can be nested within each other, thus defining an implicit tree. Every time you enter the decompose of an operation, you begin a new "sequential section". Every time you want to yield a cirq.Moment() of operations, you begin a new "parallel section".

The nested sections would define a tree and we need to keep some careful book keeping to ensure the constraints of parallel / sequential sections.

One good example that demonstrates this complexity is as follows, written using the context manager implementation given by @NoureldinYosri above.

def cccnot(c0, c1, c2, t) -> cirq.OP_TREE:
    # Perform a CCCNOT using a single ancilla and 3 CCNOTs
    anc = qalloc()
    yield cirq.CCNOT(c0, c1, anc)
    yield cirq.CCNOT(anc, c2, t)
    yield cirq.CCNOT(c0, c1, anc)

def ccswap(c0, c1, t0, t1) -> cirq.OP_TREE:
    # This should be a sequential section since the ancilla qubit can be reused among the 3 CCCNOTs.
    yield from cccnot(c0, c1, t0, t1)
    yield from cccnot(c0, c1, t1, t0)
    yield from cccnot(c0, c1, t0, t1)

c = cirq.Circuit()
with program(name='sample'):
    with section():
        # This should be a parallel section since the two operation act on disjoint qubits.
        q = cirq.LineQubit.range(8)
        c.append([cccnot(*q[:3], q[3])])
        c.append([cccnot(*q[4:7], q[7])])
# Circuit is as expected if we use CCCNOTs - the two allocated qubits are different.
print(c)

c = cirq.Circuit()
with program(name='sample'):
    with section():
        # This should be a parallel section since the two operation act on disjoint qubits.
        q = cirq.LineQubit.range(8)
        c.append([ccswap(*q[:3], q[3])])
        c.append([ccswap(*q[4:7], q[7])])
# We now end up allocating 6 qubits instead of 2 because of the naive "never reuse" strategy enforced recursively on subcircuits.
print(c)

The output is as follows:

             ┌──┐   ┌──┐   ┌──┐
0: ───────────@─────────────@─────
              │             │
1: ───────────@─────────────@─────
              │             │
2: ───────────┼──────@──────┼─────
              │      │      │
3: ───────────┼──────X──────┼─────
              │      │      │
4: ───────────┼@─────┼──────┼@────
              ││     │      ││
5: ───────────┼@─────┼──────┼@────
              ││     │      ││
6: ───────────┼┼─────┼@─────┼┼────
              ││     ││     ││
7: ───────────┼┼─────┼X─────┼┼────
              ││     ││     ││
ancilla0: ────X┼─────@┼─────X┼────
               │      │      │
ancilla1: ─────X──────@──────X────
             └──┘   └──┘   └──┘
             ┌──┐   ┌──┐   ┌──┐   ┌──┐   ┌──┐   ┌──┐   ┌──┐   ┌──┐   ┌──┐
0: ───────────@─────────────@──────@─────────────@──────@─────────────@─────
              │             │      │             │      │             │
1: ───────────@─────────────@──────@─────────────@──────@─────────────@─────
              │             │      │             │      │             │
2: ───────────┼──────@──────┼──────┼──────X──────┼──────┼──────@──────┼─────
              │      │      │      │      │      │      │      │      │
3: ───────────┼──────X──────┼──────┼──────@──────┼──────┼──────X──────┼─────
              │      │      │      │      │      │      │      │      │
4: ───────────┼@─────┼──────┼@─────┼@─────┼──────┼@─────┼@─────┼──────┼@────
              ││     │      ││     ││     │      ││     ││     │      ││
5: ───────────┼@─────┼──────┼@─────┼@─────┼──────┼@─────┼@─────┼──────┼@────
              ││     │      ││     ││     │      ││     ││     │      ││
6: ───────────┼┼─────┼@─────┼┼─────┼┼─────┼X─────┼┼─────┼┼─────┼@─────┼┼────
              ││     ││     ││     ││     ││     ││     ││     ││     ││
7: ───────────┼┼─────┼X─────┼┼─────┼┼─────┼@─────┼┼─────┼┼─────┼X─────┼┼────
              ││     ││     ││     ││     ││     ││     ││     ││     ││
ancilla0: ────X┼─────@┼─────X┼─────┼┼─────┼┼─────┼┼─────┼┼─────┼┼─────┼┼────
               │      │      │     ││     ││     ││     ││     ││     ││
ancilla1: ─────┼──────┼──────┼─────X┼─────@┼─────X┼─────┼┼─────┼┼─────┼┼────
               │      │      │      │      │      │     ││     ││     ││
ancilla2: ─────┼──────┼──────┼──────┼──────┼──────┼─────X┼─────@┼─────X┼────
               │      │      │      │      │      │      │      │      │
ancilla3: ─────X──────@──────X──────┼──────┼──────┼──────┼──────┼──────┼────
                                    │      │      │      │      │      │
ancilla4: ──────────────────────────X──────@──────X──────┼──────┼──────┼────
                                                         │      │      │
ancilla5: ───────────────────────────────────────────────X──────@──────X────
             └──┘   └──┘   └──┘   └──┘   └──┘   └──┘   └──┘   └──┘   └──┘

tanujkhattar commented 1 year ago

The more I think about this, the more convinced I become that having a qubit manager for constructing Cirq circuits is not the right thing to do. My reasons are as follows: 1) In Cirq, the way to add a dependency between operations in the underlying quantum compute graph is by applying the gates on qubits and inserting the operations in a circuit. A dependency is added between two operations if they share the same qubit. 2) We can categorize the dependencies between operations as explicit or implicit - an explicit dependency is one specified by the user when constructing the compute graph. These dependencies are informed only by the qubit registers which represent input / output to the algorithm. Any additional temporary workspace that needs to be allocated as part of executing the algorithm is ignored when thinking about the explicit dependencies. 3) Implicit dependencies, on the other hand, are dependencies that are introduced when routing a logical compute graph on a specific set of qubits, thus making concrete choices about which qubits should be allocated for the temporary workspace. 4) Making decisions about the qubit allocations for implicit dependencies becomes much easier if we have the entire logical compute graph to inspect. By adding a qubit manager to cirq and using it at the time of circuit construction, we are instead forced to make this decision while computing the logical compute graph. This makes it much harder to provide a flexible framework which can deal with the different tradeoffs.

My proposal going forward would be to: 1) Use Bloq's to represent the logical compute graph and add a compilation layer where we can write a tool to inspect the entire logical compute graph and use a qubit manager to make qubit allocations -- this would be much simpler to write IMO. 2) If we want to support this in Cirq directly, we can do a two step process where: a) Add support for a naive qalloc / qfree which always allocates a new qubit and has no reuse -- this ensures that no new implicit dependencies are added as part of logical circuit construction. b) Add a post processing step which inspects the circuit and does a smarter qubit allocation / deallocation.

@mpharrigan @NoureldinYosri What do you think?

mpharrigan commented 1 year ago

This is basically done

qualtran Bloq uses left- and right- registers to logically imply allocation and de-allocation
cirq interop uses cirq-ft's new allocation strategies when compiling to quantum assembly language

quantumlib / Qualtran

Add a qubit manager which can allocate / deallocate clean and dirty qubits. #14

some code

...

60 is an initial attempt at this, but I'd like to see a bit of a description of how it will work in context. Some questions: