Improve the slow complication of MPSCircuit

royess commented 6 months ago

Issue Description

In my tests, MPSCircuits take much longer to get jitted than the same circuits. For example, within 2 minutes for a default-type circuit of 8 qubits, but at least more than 40 minutes for the same circuit in MPS. (I didn't wait for the complication to finish.)

Considering that we typically want to use an MPS simulator for a relatively large circuit size, this makes using MPS circuits particularly difficult.

Proposed Solution

Additional References

If it helps, I can provide a script with a slow MPS complication. This seems to be a general problem for slightly large circuit sizes, though.

royess commented 6 months ago

@refraction-ray I would appreciate your comments or suggestions.

refraction-ray commented 6 months ago

Indeed, jitting a large MPS circuit would require longer times than a plain Circuit, the jitting time varies significantly for different backend (jax vs. tf) or different hardware (cpu vs. gpu). What is the depth of you test circuit, typically MPSCircuit of 8 qubits wouldn't require that longer jitting time. Besides, you may also try the unjit version to see whether the running time is acceptable?

royess commented 6 months ago

Thanks for the quick reply! Currently, I am using jax+cpu.

What is the depth of you test circuit

In total, 188 one- or two-qubit gates. The depth should be around 24, then.

By the way, I also tested a circuit of 320 gates via the snippet you provided in https://github.com/tencent-quantum-lab/tensorcircuit/issues/204#issuecomment-2072173800 (but changing the circuit to contain 8 qubits and increase the depth) in the same environment. That takes a jit time of about 120s.

Differences:

My circuit (the slow one) does not obey the 1D locality, i.e., gates are very non-local.
There are function calls to append components to build my circuit.

Besides, you may also try the unjit version to see whether the running time is acceptable?

I tried. But the training seems much slower than plain circuit simulator. It's not quite acceptable for my needs.

Muzhou-Ma commented 6 months ago

Hi, I'm also testing JIT compilation with MPSCircuit with @royess. I use a circuit of the same depth but with geometric locality, and the compile time is faster (since we are not able to finish jit compiling for the non-local circuit, I can't say how much faster). However, I don't understand why locality will cause such a difference. What is the logic of in Tensorcircuit when doing contraction?

Muzhou-Ma commented 6 months ago

However, the compiling time is still prolonged when scaling up the qubit number. It takes about 15 minutes for four qubits and more than 2 hours for 16 qubits (the compilation process is still unfinished). This runs counter to the purpose of using MPSCircuit, which is to optimize the computation resource (both time and memory) when scaling up the qubit number. Is there any way to make this better?

refraction-ray commented 6 months ago

The locality is very important for MPSCircuit. If a non-local two-qubit gate is applied, it will be firstly transformed into a series of local swap gates + local two-qubit gate, and all these gates will be applied to the MPS sequentially. The reason is that only local two-qubit tensor can be safely applied and truncated to merge into the MPS for TEBD like algorithms.

refraction-ray commented 6 months ago

for unjit version, if AD is not required in your workflow, maybe numpy backend is the fastest

refraction-ray commented 6 months ago

Another possible workaround is when your circuit has some time periodicity, then a scan wrapper can greatly reduce the jitting time, see an example for Circuit: https://github.com/tencent-quantum-lab/tensorcircuit/blob/master/examples/hea_scan_jit_acc.py. I believe the example can be transfered to MPSCircuit, with in and out the stacked MPS tensors.

royess commented 6 months ago

Thanks for your advice! But I think we need AD and do not have time periodicity in our circuit.

royess commented 6 months ago

Do you think it is doable to speed up the complication for MPSCircuit? We will be happy to help if you have ideas on how to work on that. (We need this feature in our research. And it seems to have no other workaround.)

Naively, I will suppose an MPSCircuit is not much more complicated than a normal one in its structure. And should it come with a reasonable jit time?

refraction-ray commented 6 months ago

Do you think it is doable to speed up the complication for MPSCircuit?

Firstly, from physics perspective, a TEBD like algorithm applied on non-periodic and very structured circuit often leads to a very large approximation error unless there are some types of theory guarantee, eg. one can show that the intermediate state in the circuit is always area law entangled.

From engineering perspective, accelerating jit time is much harder than accelerating running time, as the former is nearly fixed by the ML framework that we have very less control.

One possible way is to support MPS with grouping qubits as one tensor instead of one qubit for one tensor. In the former case, much fewer QR or SVD is required and the approximation error is more controllable. eg. d qubits as one tensor leg, the mps tensor has dimension ($\chi$, 2^d, $\chi$). Then only two-qubit gates across different qubit groups requires truncation, gates within one group is directly merged to the MPS tensor by matrix multuplication.

refraction-ray commented 6 months ago

And what is the target circuit metric (qubits number, error, circuit depth, gate number etc.) in your case? Also, have you tried tf backend? the jitting time is much shorter

Muzhou-Ma commented 6 months ago

And what is the target circuit metric (qubits number, error, circuit depth, gate number etc.) in your case?

The target circuit contains about 80 qubits, and about 60 layers of all-to-all connected non-local 2-qubit gates, making it roughly 3000 non-local 2-qubit gates. Do you think it is possible to jit such circuits (with or without MPS)?

Also, have you tried tf backend? the jitting time is much shorter

Yes, we have tried tf backend. Unfortunately, tf backend does not have hittable version of QR decomposition, so jitting MPS is impossible.

refraction-ray commented 6 months ago

The target circuit contains about 80 qubits, and about 60 layers of all-to-all connected non-local 2-qubit gates, making it roughly 3000 non-local 2-qubit gates.

To me, the scale for the simulation is very challenging, by translating to local 2-qubit gates, I guess roughly 30000 two-qubit gates are required for 80 qubits system. Even if MPSCircuit can simulate this, the accuracy would be bad in general. Actually, the simulation scale is even beyond the quantum supremacy experiments, I don't see this as an easy task to run by calling API with one GPU.

Unfortunately, tf backend does not have hittable version of QR decomposition

What do you mean by this, can you run the circuit with jitted tf backend? I don't think vmap is relevant for your use case since one circuit is challenging enough to simulate after all? There is no need to "stack" multiple circuits together to simulate

Muzhou-Ma commented 6 months ago

To me, the scale for the simulation is very challenging, by translating to local 2-qubit gates, I guess roughly 30000 two-qubit gates are required for 80 qubits system. Even if MPSCircuit can simulate this, the accuracy would be bad in general. Actually, the simulation scale is even beyond the quantum supremacy experiments, I don't see this as an easy task to run by calling API with one GPU.

I see, thanks a lot.

There is no need to "stack" multiple circuits together to simulate

Yes, I understand this. However, for our task, we need to run many batches. As you have said, if simulating one circuit is challenging enough, then there's no reason to consider stacking them together.

Muzhou-Ma commented 6 months ago

@refraction-ray Thank you for the discussion. It seems that it is no longer a technical problem. The task we face has a fundamental hardness, which is unsolvable with current classical simulation techniques and reasonable computation resources.

tencent-quantum-lab / tensorcircuit