Open royess opened 6 months ago
@refraction-ray I would appreciate your comments or suggestions.
Indeed, jitting a large MPS circuit would require longer times than a plain Circuit, the jitting time varies significantly for different backend (jax vs. tf) or different hardware (cpu vs. gpu). What is the depth of you test circuit, typically MPSCircuit of 8 qubits wouldn't require that longer jitting time. Besides, you may also try the unjit version to see whether the running time is acceptable?
Thanks for the quick reply! Currently, I am using jax+cpu.
What is the depth of you test circuit
In total, 188 one- or two-qubit gates. The depth should be around 24, then.
By the way, I also tested a circuit of 320 gates via the snippet you provided in https://github.com/tencent-quantum-lab/tensorcircuit/issues/204#issuecomment-2072173800 (but changing the circuit to contain 8 qubits and increase the depth) in the same environment. That takes a jit time of about 120s.
Differences:
Besides, you may also try the unjit version to see whether the running time is acceptable?
I tried. But the training seems much slower than plain circuit simulator. It's not quite acceptable for my needs.
Hi, I'm also testing JIT compilation with MPSCircuit with @royess. I use a circuit of the same depth but with geometric locality, and the compile time is faster (since we are not able to finish jit compiling for the non-local circuit, I can't say how much faster). However, I don't understand why locality will cause such a difference. What is the logic of in Tensorcircuit when doing contraction?
However, the compiling time is still prolonged when scaling up the qubit number. It takes about 15 minutes for four qubits and more than 2 hours for 16 qubits (the compilation process is still unfinished). This runs counter to the purpose of using MPSCircuit, which is to optimize the computation resource (both time and memory) when scaling up the qubit number. Is there any way to make this better?
The locality is very important for MPSCircuit. If a non-local two-qubit gate is applied, it will be firstly transformed into a series of local swap gates + local two-qubit gate, and all these gates will be applied to the MPS sequentially. The reason is that only local two-qubit tensor can be safely applied and truncated to merge into the MPS for TEBD like algorithms.
for unjit version, if AD is not required in your workflow, maybe numpy backend is the fastest
Another possible workaround is when your circuit has some time periodicity, then a scan wrapper can greatly reduce the jitting time, see an example for Circuit
: https://github.com/tencent-quantum-lab/tensorcircuit/blob/master/examples/hea_scan_jit_acc.py. I believe the example can be transfered to MPSCircuit
, with in and out the stacked MPS tensors.
Thanks for your advice! But I think we need AD and do not have time periodicity in our circuit.
Do you think it is doable to speed up the complication for MPSCircuit? We will be happy to help if you have ideas on how to work on that. (We need this feature in our research. And it seems to have no other workaround.)
Naively, I will suppose an MPSCircuit is not much more complicated than a normal one in its structure. And should it come with a reasonable jit time?
Do you think it is doable to speed up the complication for MPSCircuit?
Firstly, from physics perspective, a TEBD like algorithm applied on non-periodic and very structured circuit often leads to a very large approximation error unless there are some types of theory guarantee, eg. one can show that the intermediate state in the circuit is always area law entangled.
From engineering perspective, accelerating jit time is much harder than accelerating running time, as the former is nearly fixed by the ML framework that we have very less control.
One possible way is to support MPS with grouping qubits as one tensor instead of one qubit for one tensor. In the former case, much fewer QR or SVD is required and the approximation error is more controllable. eg. d qubits as one tensor leg, the mps tensor has dimension ($\chi$, 2^d, $\chi$). Then only two-qubit gates across different qubit groups requires truncation, gates within one group is directly merged to the MPS tensor by matrix multuplication.
And what is the target circuit metric (qubits number, error, circuit depth, gate number etc.) in your case? Also, have you tried tf backend? the jitting time is much shorter
And what is the target circuit metric (qubits number, error, circuit depth, gate number etc.) in your case?
The target circuit contains about 80 qubits, and about 60 layers of all-to-all connected non-local 2-qubit gates, making it roughly 3000 non-local 2-qubit gates. Do you think it is possible to jit such circuits (with or without MPS)?
Also, have you tried tf backend? the jitting time is much shorter
Yes, we have tried tf backend. Unfortunately, tf backend does not have hittable version of QR decomposition, so jitting MPS is impossible.
The target circuit contains about 80 qubits, and about 60 layers of all-to-all connected non-local 2-qubit gates, making it roughly 3000 non-local 2-qubit gates.
To me, the scale for the simulation is very challenging, by translating to local 2-qubit gates, I guess roughly 30000 two-qubit gates are required for 80 qubits system. Even if MPSCircuit can simulate this, the accuracy would be bad in general. Actually, the simulation scale is even beyond the quantum supremacy experiments, I don't see this as an easy task to run by calling API with one GPU.
Unfortunately, tf backend does not have hittable version of QR decomposition
What do you mean by this, can you run the circuit with jitted tf backend? I don't think vmap is relevant for your use case since one circuit is challenging enough to simulate after all? There is no need to "stack" multiple circuits together to simulate
To me, the scale for the simulation is very challenging, by translating to local 2-qubit gates, I guess roughly 30000 two-qubit gates are required for 80 qubits system. Even if MPSCircuit can simulate this, the accuracy would be bad in general. Actually, the simulation scale is even beyond the quantum supremacy experiments, I don't see this as an easy task to run by calling API with one GPU.
I see, thanks a lot.
There is no need to "stack" multiple circuits together to simulate
Yes, I understand this. However, for our task, we need to run many batches. As you have said, if simulating one circuit is challenging enough, then there's no reason to consider stacking them together.
@refraction-ray Thank you for the discussion. It seems that it is no longer a technical problem. The task we face has a fundamental hardness, which is unsolvable with current classical simulation techniques and reasonable computation resources.
Issue Description
In my tests, MPSCircuits take much longer to get jitted than the same circuits. For example, within 2 minutes for a default-type circuit of 8 qubits, but at least more than 40 minutes for the same circuit in MPS. (I didn't wait for the complication to finish.)
Considering that we typically want to use an MPS simulator for a relatively large circuit size, this makes using MPS circuits particularly difficult.
Proposed Solution
Additional References
If it helps, I can provide a script with a slow MPS complication. This seems to be a general problem for slightly large circuit sizes, though.