[simulation] Fix a bug which may have caused multi-qubit gates to be attached

There is one case that is missed when postponing attaching single-qubit gates to multi-qubit gates as far as possible in #100: if there is a multi-qubit gate with exactly one local qubit in the current stage but at least one of the global qubits will become local in the next stage, then we must attach the gate in this stage. This PR implements this case.

This PR also updates test_simulation to the new cost function and also adds some debug assertions.

Benchmark: qft circuit, 33 total qubits, 28 total qubits, with the new cost function: Before (wrong): 14 fusion, 3 shared-memory, total cost = 256 + 25.4 = 281.4, running time = 12s After: 14 fusion, 3 shared-memory, total cost = 268.5 + 25.4 = 293.9, running time = 13s If we remove all SWAP gates: total cost = 7 fusion (2 empty), 4 shared-memory, total cost = 81 + 218.1 - 12.6 = 286.5, running time = 11s

quantum-compiler / quartz

[simulation] Fix a bug which may have caused multi-qubit gates to be attached #111