Fixes the bug when greedily executing the kernels after the DP, we might execute multi-qubit gates with a non-insular qubit before executing a previous multi-qubit gate with the same qubit as an insular qubit before this PR.
When closing shared-memory kernels, we now add as many touching_qubits as possible, instead of sticking to the existing touching_qubits. However, we still do not remove existing absorbing kernels, although it may be better to remove some existing absorbing kernels' active_qubits in exchange for more touching_qubits in the new absorbing kernel.
TODOs:
We still have empty kernels in realamprandom.
We may want to update the cost for each gate based on the single-qubit gates attached to it.
Do we want to try removing some existing absorbing kernels' active_qubits in exchange for more touching_qubits in the new absorbing kernel?
We now have a mismatch about touching_kernels in DP and after DP when greedily executing the gates. This may cause more gates to be put in the earlier kernel when it can be put into a later kernel instead. This can be bad if the earlier kernel is a shared-memory kernel and the later kernel is a fusion kernel.
Changes:
touching_qubits
as possible, instead of sticking to the existingtouching_qubits
. However, we still do not remove existing absorbing kernels, although it may be better to remove some existing absorbing kernels'active_qubits
in exchange for moretouching_qubits
in the new absorbing kernel.TODOs:
realamprandom
.active_qubits
in exchange for moretouching_qubits
in the new absorbing kernel?touching_kernels
in DP and after DP when greedily executing the gates. This may cause more gates to be put in the earlier kernel when it can be put into a later kernel instead. This can be bad if the earlier kernel is a shared-memory kernel and the later kernel is a fusion kernel.Benchmark:
realamprandom
, 28 total qubits, 28 local qubits Before: 42 kernels (6 fusion (0 non-empty), 36 shared-memory (30 non-empty)), cost = 1318.8, running time = 52s After: 22 kernels (1 fusion (0 non-empty), 21 shared-memory (20 non-empty)), cost = 1125.8, running time = 52s