quantumlib / Qualtran

Qᴜᴀʟᴛʀᴀɴ is a Python library for expressing and analyzing Fault Tolerant Quantum algorithms.
https://qualtran.readthedocs.io/en/latest/
Apache License 2.0
132 stars 35 forks source link

Computing qubit costs for QROM is very slow #1052

Open tanujkhattar opened 1 month ago

tanujkhattar commented 1 month ago

I need the following code to run in 3-4 seconds; not 10+ minutes on a Google colab instance.

%%time
from qualtran.resource_counting import get_cost_value, QubitCount

data = [*range(2**12)]
log_block_sizes = [*range(1, 9)]

qroam_clean = SelectSwapQROM.build_from_data(data, use_dirty_ancilla=False)
qroam_dirty = SelectSwapQROM.build_from_data(data, use_dirty_ancilla=True)

costs_cache = {}
qroam_t_costs_clean, qroam_t_costs_dirty = [], []
qroam_qubit_costs_clean, qroam_qubit_costs_dirty = [], []
for b in log_block_sizes:
  print(f'{b=}')
  qroam_clean_curr = qroam_clean.with_log_block_sizes(log_block_sizes=[b])
  qroam_dirty_curr = qroam_dirty.with_log_block_sizes(log_block_sizes=[b])
  qroam_t_costs_clean.append(qroam_clean_curr.t_complexity().t)
  qroam_t_costs_dirty.append(qroam_dirty_curr.t_complexity().t)
  qroam_qubit_costs_clean.append(get_cost_value(qroam_clean_curr, QubitCount(), costs_cache=costs_cache))
  qroam_qubit_costs_dirty.append(get_cost_value(qroam_dirty_curr, QubitCount(), costs_cache=costs_cache))

qrom_t_cost = QROM.build_from_data(data).t_complexity().t
qrom_qubit_cost = get_cost_value(QROM.build_from_data(data),  QubitCount(), costs_cache=costs_cache)

Hers is the timing from my current run -

image

I want to generate a plot like this

image
tanujkhattar commented 1 month ago

On current main

image

By making a series of trivial fixes to the cirq interop (like replacing CNOT().on(a, b) in the cirq-style decomposition with cirq.CNOT(a, b) (the former is not needed), adding cache to total_qubits and `extract_bloq_from_cirq1), we get

image

The cirq interop is still contributing about 50% to the total time, so if we rewrite the bloq natively without any cirq interop, we should be able to get another 2x speedup.

However, it is still going to be fairly slow (a few minutes at least for 2**12 example) for any reasonable system size; and this is not due to recursive implementation of QROM, but rather due to the fact that the size of the composite graph (product of gates x width) becomes large quickly.

Ideally, for common cost metrics like QubitCount, we should try to avoid having to decompose the entire bloq at all (xref https://github.com/quantumlib/Qualtran/issues/957)

mpharrigan commented 1 month ago

Ideally, we'd be able to do one level of decomposition of each bloq with relevant parameters in a small amount of time. So I think it's worthwhile seeing exactly what's the limiting factor of being able to decompose a N=4000 qrom.

One double-win (better pedagogy/abstraction) would be to group the CNOTs into a ctrled-XOR or ctrled-SET operation that can operate directly on a single bitsize=n_target register

Otherwise, I can try profiling a decomposition with roughly 10,000 subbloqs. Last time I did this, there was a huge contribution from tracking whether each soquet has been used exactly once. That's something we can futz around with easily.