WIP: add skip transpile to t1

Summary

Please describe what this PR changes as concisely as possible. Link to the issue(s) that this addresses, if any.

Details and comments

Some details that should be in this section include:

Why this change was necessary
What alternative solutions were considered and why the current solution was chosen
What tests and documentation have been added/updated
What do users and developers need to know about this change

Note that this entire PR description field will be used as the commit message upon merge, so please keep it updated along with the PR. Secondary discussions, such as intermediate testing and bug statuses that do not affect the final PR, should be in the PR comments.

PR checklist (delete when all criteria are met)

[ ] I have read the contributing guide CONTRIBUTING.md.
[ ] I have added the tests to cover my changes.
[ ] I have updated the documentation accordingly.
[ ] I have added a release note file using reno if this change needs to be documented in the release notes.

I did some profiling of this PR, #1455, and main using the following script:

import importlib.metadata
import os
import time
from subprocess import run

from qiskit_ibm_runtime import QiskitRuntimeService

from qiskit_experiments.library import T1
from qiskit_experiments.framework import ParallelExperiment

service = QiskitRuntimeService(channel="ibm_quantum")
backend_real = service.backend("ibm_sherbrooke")

start = time.perf_counter()
t1_exp_tmp_list = []
coh_wait_list = [0,10,20]
for i in list(range(10)):
    t1_exp_tmp = T1([i], coh_wait_list, backend=backend_real)
    t1_exp_tmp.set_transpile_options(optimization_level=0)
    t1_exp_tmp_list.append(t1_exp_tmp)

t1_exp_tst = ParallelExperiment(t1_exp_tmp_list, backend=backend_real, flatten_results=True)
t1_exp_tst._transpiled_circuits()
total_time = time.perf_counter() - start

qe_proc = run(["git", "rev-parse", "--abbrev-ref", "HEAD"], check=True, text=True, capture_output=True)
print(f"Qiskit: {importlib.metadata.version('qiskit')}; QE: {qe_proc.stdout.strip()}; Parallel: {os.environ['QISKIT_PARALLEL']}")
print(f"Total time: {total_time:0.1f} seconds")

I ran this on the current stable Qiskit (1.0.2) and the current main (this commit in particular but most importantly a branch containing https://github.com/Qiskit/qiskit/pull/12410). For the main tests, I used pip install . (non-editable) so that the Rust compilation was not in debug mode. Here are the results:

Qiskit: 1.0.2; QE: main; Parallel: TRUE                    
Total time: 37.2 seconds                       

Qiskit: 1.0.2; QE: main; Parallel: FALSE          
Total time: 38.7 seconds

Qiskit: 1.0.2; QE: sampler_skiptrans; Parallel: TRUE
Total time: 3.2 seconds      

Qiskit: 1.0.2; QE: sampler_skiptrans; Parallel: FALSE
Total time: 3.7 seconds 

Qiskit: 1.0.2; QE: fast-transpile; Parallel: TRUE   
Total time: 4.3 seconds      

Qiskit: 1.0.2; QE: fast-transpile; Parallel: FALSE   
Total time: 4.5 seconds 

Qiskit: 1.2.0; QE: main; Parallel: TRUE             
Total time: 35.7 seconds     

Qiskit: 1.2.0; QE: main; Parallel: FALSE             
Total time: 4.9 seconds                      

Qiskit: 1.2.0; QE: sampler_skiptrans; Parallel: TRUE                                                                   
Total time: 4.6 seconds

Qiskit: 1.2.0; QE: sampler_skiptrans; Parallel: FALSE
Total time: 4.5 seconds

Qiskit: 1.2.0; QE: fast-transpile; Parallel: TRUE
Total time: 4.5 seconds

Qiskit: 1.2.0; QE: fast-transpile; Parallel: FALSE
Total time: 4.7 seconds

Note these are statistics of one, so there are probably decent fractional error bars on the numbers (10%?). I tried re-running the Qiskit: 1.2.0; QE: main; Parallel: FALSE case a few times and got 3.6, 3.9, 4.5, and 4.7 seconds.

From these numbers we see that QE main with Qiskit 1.0.2 is significantly slower than the other QE branches, whether or not tranpiler parallelism is used. Additionally, QE main is still slow with Qiskit 1.2.0 when parallelism is used. These results are all consistent with the motivation for https://github.com/Qiskit/qiskit/pull/12410 -- for a large device with pulse calibrations, serializing the target (done even when parallelism is disabled on Qiskit 1.0.2) costs significantly more than the benefit of parallelizing the transpilation. Skipping the serialization as is done in the QE main with Qiskit 1.2.0 case cuts down the processing time significantly, making it indistinguishable from the two branches that bypass transpilation.

Though for these small circuits we don't make much use of the transpiler's cutting edge features, it is still nice to let it handle layout and basis translation. While it is nice to format the experiments to generate circuits using the typical IBM basis gates, it is also nice to have experiments that could in principle run on other providers that used rx or h instead of sx if we don't have to give that up. On Windows and mac OS, the default is not to parallelize transpile, so they should already be fast with Qiskit 1.2.0. On Linux, parellelism is enabled by default. So I think the minimal change to avoid this serialization overhead would be to set num_processes=1 in the default transpile options if the Qiskit version is 1.0 or higher (the option was added in Qiskit 1.0). We would need to decide whether to make this the default in BaseExperiment or in the individual experiments we decide are small. If doing it in BaseExperiment, we should consider setting the default back to None for QV (the only experiment where transpilation is non-trivial, I think). What do you think @dcmckayibm @nkanazawa1989 ?

Thanks Will. I'm glad to see the transpile overhead itself is almost negligible (at least in the timescale of whole experiment run). It would be nice to keep full transpile stages if it's performance is comparable, but we can also prepare custom staged pass manager including only translation and layout (only expand passes), because it's reasonable to assume the most of QE instances define ISA circuits. But according to your test the speed gain from doing this would be less than 1 sec, which doesn't really make sensible performance difference in our case.

In that sense just updating the default transpile option with num_processes = 1 is the most reasonable approach for now (we can check qiskit major version or define some global flag at import). I will review and merge in 0.7 if you can write PR.

Thanks @wshanks , another option would be to do a check in the class constructor (sort of like what I did in the _transpiled_circuits, but I think it should be moved to the constructor based on an oliver suggestion) for the basis gates being on the backend and then invoke transpilation if they are not. This leaves open the option of using using the circuits for more general backends but leaves the speed improvements. On top of this I definitely agree with the num_process change.

@nkanazawa1989 I will make a PR for num_processes since I think we want that any way. I agree about a staged pass manager that just does layout and translation being appropriate. When I started working on Qiskit Experiments, I didn't understand the transpiler very well (not that I am an expert now; I don't have a good understanding of why the calibration pass manager needs four passes to do layout for example). I wonder if experiment classes should be defining a default pass manager instead of a default set of transpile options. On the other hand, then customization gets difficult if the user needs to choose between the experiment's default and passing a new pass manager to override it. Allowing the flexibility to adjust parts of the pass manager starts to look like the transpile() function since that creates a pass manager with options for adjusting parts of it....

@dcmckayibm I am confused about the class constructor point. What is the flow of information that you are expecting there? One technical point is that the backend is not a required argument of the constructor, but there is a set_backend hook that gets called when the backend is set if it was not in the constructor, so this is a minor point. But whether the basis gates are there or not, you still need to handle layout. So would you still have a circuits that could build the circuits with the target layout and then have a custom _transpiled_circuits that would do a transpile if the basis gates did not match? Why not just check the basis gates in the custom _transpiled_circuits?

I tried editing the test code so that the subexperiments had 100 circuits instead of 3. In that case this PR and #1455 both still measured a time around 4.5 seconds (which I think is mostly the experiment construction time since it did not change compared to the 3 circuits case) while main increased to around 7.8 seconds. So transpiling is taking around 30 ms per circuit. It is a little bit interesting that #1455 is still only negligibly slower than this PR for 100 circuits even though it rebuilds every circuit with new indices rather than building with the final indices to start out. Given that, I lean towards #1455 over this PR because it is nice to leave circuits() untouched and only change the behavior of _transpiled_circuits() (and in a systematic way like in #1455 that can be applied to any experiment).

Ideally, transpiling small circuits would have low enough overhead that we could just use the transpiler but maybe we are not there yet (Qiskit is still pushing for performance improvements, how much overhead per circuit can we accept?). For generate_preset_pass_manager(0, backend=backend, initial_layout=[0, 1]), I see for pm.to_flow_controller().passes:

[<qiskit.transpiler.passes.utils.contains_instruction.ContainsInstruction at 0x7f824b2b46e0>,                                                                                                                                                                                                                                 
 <qiskit.passmanager.flow_controllers.ConditionalController at 0x7f8248fa7410>,                                                                                
 <qiskit.transpiler.passes.synthesis.unitary_synthesis.UnitarySynthesis at 0x7f824aee71a0>,                                                                    
 <qiskit.transpiler.passes.synthesis.high_level_synthesis.HighLevelSynthesis at 0x7f8248fa7440>,                                                               
 <qiskit.transpiler.passes.basis.basis_translator.BasisTranslator at 0x7f8248f5c050>,                                                                          
 <qiskit.transpiler.passes.layout.set_layout.SetLayout at 0x7f824a5460c0>,                                                                                     
 <qiskit.passmanager.flow_controllers.ConditionalController at 0x7f824a411a30>,                                                                                
 <qiskit.transpiler.passes.layout.full_ancilla_allocation.FullAncillaAllocation at 0x7f824a4622d0>,                                                            
 <qiskit.transpiler.passes.layout.enlarge_with_ancilla.EnlargeWithAncilla at 0x7f824aeb8770>,                                                                  
 <qiskit.transpiler.passes.layout.apply_layout.ApplyLayout at 0x7f824b27a210>,                                                                                 
 <qiskit.transpiler.passes.utils.check_map.CheckMap at 0x7f824ad041a0>,                                                                                        
 <qiskit.passmanager.flow_controllers.ConditionalController at 0x7f8248fa5bb0>,                                                                                
 <qiskit.transpiler.passes.utils.filter_op_nodes.FilterOpNodes at 0x7f8248fa5df0>,                                                                             
 <qiskit.transpiler.passes.synthesis.unitary_synthesis.UnitarySynthesis at 0x7f824a894140>,                                                                    
 <qiskit.transpiler.passes.synthesis.high_level_synthesis.HighLevelSynthesis at 0x7f824b2b4ec0>,                                                               
 <qiskit.transpiler.passes.basis.basis_translator.BasisTranslator at 0x7f8248fa67b0>,                                                                          
 <qiskit.transpiler.passes.utils.check_gate_direction.CheckGateDirection at 0x7f8248fa6870>,                                                                   
 <qiskit.passmanager.flow_controllers.ConditionalController at 0x7f8248fa7290>,                                                                                
 <qiskit.transpiler.passes.synthesis.unitary_synthesis.UnitarySynthesis at 0x7f824a894140>,                                                                    
 <qiskit.transpiler.passes.synthesis.high_level_synthesis.HighLevelSynthesis at 0x7f824b2b4ec0>,                                                               
 <qiskit.transpiler.passes.basis.basis_translator.BasisTranslator at 0x7f8248fa67b0>,                                                                          
 <qiskit.transpiler.passes.calibration.pulse_gate.PulseGates at 0x7f8248fa72c0>,    
 <qiskit.transpiler.passes.utils.contains_instruction.ContainsInstruction at 0x7f8248fa7260>,                                                                  
 <qiskit.passmanager.flow_controllers.ConditionalController at 0x7f8248fa7350>]

Besides layout, those should all be just checks that find nothing to do. Would a more limited pass manager be any faster?

I guess negligible overhead of rewriting the circuit is thanks to the singleton gates (especially in T1 circuits without parameterized gate).

@dcmckayibm In the original design circuits() is a method that returns virtual circuits and its widths must match with the physical qubits length. This guarantees the parallel experiment works with any experiments. If we allow physical circuits in the circuits() it allows experiment to include instructions on a qubit not in the physical qubits list, which makes parallel experiment implementation much more complicated. So I want to remain current convention unchanged.

This also makes me think virtual index -> physical index mapping is a common pattern for all experiments, and we can also implement this logic in the BaseExperiment.run and remove the builtin transpile call from the _transpiled_circuits() method. Only custom experiment like QV that requires SWAP routing and translation must explicitly fill the method (or _transpiled_circuits() just implements translation stage and the method is called only when the backend doesn't support all instructions there, as @dcmckayibm suggests). If I write the circuit extender in #1455 as a small Rust code this will become much more faster. However I guess this would be difference of 10s ms, and I'm not sure this is still worth implementing. Perhaps, if I run massive parallel experiment with 100s of qubits, you could gain speedup of few sec, then it could benefits some daily calibration routine.

I did some further testing of variations on transpiling like those in this PR and in #1455. The code I used for the testing is in this gist. I will discuss the script below but first here are the outputs I got from running it with Qiskit 1.1 (and my qiskit-experiments repo was checked out to the branch from #1455 though the test script doesn't use anything relevant to a specific branch, just BackendTiming):

t1_circuits_warmup: 3.7 seconds per call
t1_circuits: 0.0098 seconds per call
t1_circuits_with_translation_pulse_gates_checks: 0.011 seconds per call
warm_up_fast_transpile_no_pm: 0.076 seconds per call
fast_transpile_no_pm: 0.074 seconds per call
build_pass_manager: 4.3e-05 seconds per call
fast_transpile_t1_serial_with_basis_and_pulse_gate_passes: 0.23 seconds per call
fast_transpile_t1_serial_with_custom_basis_and_pulsegate_checks: 0.079 seconds per call
standard_transpile_serial: 0.34 seconds per call
limited_pm_serial: 0.32 seconds per call

The warm up lines are just to make sure one-time processing like deserializing the target does not penalize the first function more than the following ones.

For all the performance tests, 100 T1 circuits were used.

t1_circuits: this test is like the method in this PR that generates the circuit with the physical qubit index right away without remapping.
t1_circuits_with_translation_pulse_gates_checks: This is this PR for remapping the indices but then does a check for if it needs to do basis translation or pulse gate calibration before building and running those passes and for this test it should not need to run the passes.
fast_transpile_no_pm: this test is like #1455 except without the test that all the instructions are in the backend's basis gates.
build_pass_manager: Just a test of how long it takes to create a PassManager which is not much. This was a quick check for pass manager overhead. If there is any extra overhead, it is delayed to the run() call.
fast_transpile_t1_serial_with_basis_and_pulse_gate_passes: This is like #1455 for remapping the indices but then builds a pass manager to do basis translation and the pulse gate pass and runs it on the circuits serially.
fast_transpile_t1_serial_with_custom_basis_and_pulsegate_checks: This is like #1455 for remapping the indices but then does a check for if it needs to do basis translation or pulse gate calibration before building and running those passes and for this test it should not need to run the passes.
standard_transpile_serial: This one just calls transpile with num_processes=1 and optimization_level=0 like qiskit-experiments has been doing.
limited_pm_serial: This one builds a pass manager that does layout, basis translation, and pulse gate calibration with num_processes=1.

There are a couple other tests commented out that let transpile or pm.run use multiprocessing. Those both took 14 seconds, corresponding to around 140 ms per circuit. Since the times were similar for the two cases, I think this is the overhead for serializing the circuit and target between processes.

So to try to summarize:

Just building the circuits takes around 10 ms (0.1 ms per circuit).
Fast transpile (recreating the circuits with physical indices) takes around 75 ms (0.75 ms per circuit)
Doing the checks for basis translation and pulse gate calibration adds only 1 ms (0.01 ms per circuit)
Doing the basis translation and pulse gate passes on a set of circuits that don't need any changes adds 125 ms (1.25 ms per circuit).
The limited transpile (which skips some passes about synthesis) is only negligibly faster than the standard transpile (310 vs 330 ms).
Standard transpile is about 4 times slower than the fast transpile (75 ms vs 330 ms).

Here are the options I see:

Most conservative: having understood the serialization overhead of Qiskit now, we could leave things as they are and live with the 3.3 ms per circuit overhead. If you consider a 100 circuit experiment in parallel on a 100 qubit device, that is 33 seconds, which is starting to get big enough to be annoying (that's the time to create the individual circuits; there is still some time to stitch them together).
Requiring limited changes: we could modify BaseExperiment._transpiled_circuits to use the fast transpile method for applying the qubit layout and then to check if translation or pulse gate passes are needed and only apply them if so. We could also update any experiments that don't match the usual basis set of IBM backends. For the 100 circuit, 100 qubit example, the time is reduced to 8.2 seconds, which is still a bit long but getting better.
Requiring wider changes: we could start updating experiments to support a mode like in this PR -- some way to generate circuits already mapped to the physical indices. Combined with the checks for the other passes, this could reduce the time to 0.11 ms per circuit based on the numbers above. For the 100 circuit, 100 qubit example, that gets the time down to around 1 second. When the pre-mapped circuit option is not available, qiskit-experiments could fall back to fast transpile or standard transpile.

A couple caveats:

Anything other than standard transpile should be reviewed to make sure that we are not missing any edge cases that would slow things down to account for.
Main Qiskit is working to reimplement the core functionality in Rust to improve performance. That could improve things overall and make some of the special handling here unnecessary. Perhaps there could be a fast way to remap the indices in Rust for example. Or perhaps standard transpile could get fast enough in the the case of only needing to map the layout and check the basis gates that doing special remapping and checking gives no benefit. Those possibilities might reduce the incentive to making wider changes, but we might still want to make a limited set of changes in the near term.

What do you think?

Thanks for careful profiling. It seems to me 8sec vs 1sec in a practical setting is enough gain to consider the wider change. I'd emphasize that transpile overhead also comes from the data conversion into DAGCircuit IR, and the speed of the fast transpile method depends on the circuit depth because it needs to loop over all operation data on the wire (e.g. with something like RB this would become much slower). Regarding Rustification, currently Qiskit doesn't implement Rust version of QuantumCircuit (it has CircuirData struct though) and I wouldn't expect much performance gain from it. This is because eventually Rust just calls Python function with PyObject.

On the other hand, the change in this PR breaks polymorphism of experiment. This allows experiment to directly create physical circuit, and what the base class run method receives will become union of virtual and physical circuit. In some edge case, it must accept a list of mixture of them. Unfortunately we don't have any attribute or type information to distinguish them. Current implementation is simple because we can naively assume experiment circuits are all virtual. Probably we need to insert another layer (base class) for experiments creating physical circuit and the other class for virtual circuits, and the base run method will be aware of only physical (or ISA) circuits. However this adds another hierarchy to the already complicated QE class tree.

If we come up with the elegant implementation I'm fine with taking the approach of the last bullet.

Near-term we want to implement some quick version of SamplerV2 support that looks a lot like the current backend.run support. Another forward looking direction that will help with performance though is that once we support SamplerV2 we can start to make use of its circuit parameterization support. In that case, we only have to transpile one circuit per qubit for an experiment like T1 instead of 100 circuits like in my examples above. In that case, even the full transpile for 100 qubits takes only 330 ms.

Not every experiment can benefit from parameterization. RB can be parameterized to realize different seeds with the same circuit by varying rz angles, but I don't think Clifford length can be parameterized.

As I was making my post with the profiling numbers, I was re-running things and making changes. I Just realized that the numbers I had pasted in were from a run with a bug that skipped the pass manager run for the fast_transpile_t1_serial_with_basis_and_pulse_gate_passes case, so it looked like it took the same time as the fast_transpile_t1_serial_with_custom_basis_and_pulsegate_checks case. I updated the times now. The times in the linked gist were correct when I posted the comment. I just didn't copy over the updated times to the comment here. The text of the comment was based on the correct times (e.g. "Doing the basis translation and pulse gate passes on a set of circuits that don't need any changes adds 125 ms").

@dcmckayibm @nkanazawa1989 Based on feedback outside of this issue, I think we have consensus for the following approach:

Add a class variable (full_transpile) that BaseExperiments._transpiled_circuits() can check. This could also be a transpile option that gets popped out before passing options to transpile() but would that cause any issues with subclasses currently overriding _trasnpiled_circuits() and not expecting it?
If the full transpile option is False, BaseExperiments._transpiled_circuits() will do #1455 fast transpile layout of the circuit to physical qubits. Then it will do a check for instructions matching the basis gates and custom calibrations in the Target. If either of those checks shows a need for doing something, a pass manager with those passes is run (or maybe a preset pass manager with layout skipped?). If the full transpile option is True, then transpile() is called like with the current code.

Does that sound good?

This PR breaks polymorphism of experiment.

@nkanazawa1989 I think your concern here could be addressed by modifying experiments to have a new mapped_circuits(self, physical_qubits) -> list[QuantumCircuit] method that always returns circuits mapped to a specific layout. Then circuits() could call this with physical_qubits=tuple(range(len(self.physical_qubits))) and _transpiled_circuits() could call it with physical_qubits=self.physical_qubits and the signature of circuits() could remain consistent. It does incur the extra complexity of experiments needing to implement mapped_circuits() instead of circuits() (circuits() is still needed as return self.mapped_circuits(physical_qubits=tuple(range(len(self.physical_qubits)))) though maybe we make a shortcut for that). This might be worth pursuing, though maybe with parameterized SamplerV2 circuits and improvements to Qiskit performance it won't be needed. I don't think it would be that hard to implement. My issue is more with not liking the complexity of asking experiment writers to write in a form that allows mapping if we don't need to.

Thanks for all the profiling @wshanks and I think the solution is a good one.

Thanks @wshanks for proposal. I also think the mapped_circuits -> Union[virtual, physical] is good idea. I'm not sure what should be the argument of this method (we can take QuantumRegister and bit location as well). If you write this PR probably we can include it in 0.7. I wonder what should happen if an experiment doesn't define mapped_circuits? Does _transpiled_circuits need to perform attribute check and raise deprecation warning?

Here is a first pass at what I suggested above: https://github.com/Qiskit-Extensions/qiskit-experiments/pull/1459

I am still thinking about edge cases and studying how these changes work. In particular, since I made it fall back to full transpile seamlessly, I would like to check which tests are using the fast transpile and which are using the full and make sure that there are no surprises.

qiskit-community / qiskit-experiments