Closed MichaelBroughton closed 3 years ago
This will definitely be a great feature especially in the context of variational circuit simulations where gradients can be obtained more quickly compared to finite difference or parameter shift (which is also somehow finite difference) when the number of variational params is large.
I also want to mention that backward AD in circuit case (reversible) can be implemented in a more memory efficient way, i.e. the memory required can be reduced to constant (width of circuit) compared to general adjoint methods where the memory consumption is proportional to the depth of network/graph/circuit. One can recompute the intermediate value in backward pass since the computation is reversible without keep them in forward pass. This can in principle enable AD of very deep circuit.
I also want to mention that backward AD in circuit case (reversible) can be implemented in a more memory efficient way, i.e. the memory required can be reduced to constant (width of circuit) compared to general adjoint methods where the memory consumption is proportional to the depth of network/graph/circuit. One can recompute the intermediate value in backward pass since the computation is reversible without keep them in forward pass. This can in principle enable AD of very deep circuit.
This sounds really cool! Would you mind elaborating on this method you are describing ? Suppose I have this circuit
(0, 0): ───H^alpha───@───
│
(0, 1): ───X^beta────X───
and I want to compute the gradient of <psi | X(0,0) * X(0, 1) | psi>
w.r.t alpha
and beta
where lets say alpha = 0.123
and beta = 0.456
. Would you mind walking through this example to help me understand what you have in mind and how it works to save memory and make things more efficient here ( I didn't exactly follow what you mentioned above ) ?
Hi, @MichaelBroughton , I have just written some short notes on this but still in an abstract fashion. I don't know whether it is clear enough or make sense to you. If you still have any questions, we can further discuss on that and maybe go through the detailed example you presented above.
You can also refer to Section 3 "Reversible Computing and Automatic Differentiation" of the Yao.jl paper . This is the key technique that allows Yao.jl to train a, say, 10000 layer variational circuit on a laptop.
Thanks @refraction-ray and @wangleiphy . I Did some more reading and spoke with @jaeyoo about this. Think we are almost ready to go ahead and try to implement this. Just wanted to confirm my understanding on something:
Suppose I have a cirq.PauliSum
with k
different cirq.PauliString
s (terms) in it, Who's overall matrix I cannot store in memory.
It is my understanding that if I have a circuit with N
parameters and I add one more parameter so I have N + 1
parameters, the cost of differentiation for that circuit will go up by roughly ~O(k)
more inner product calculations. This means that adjoint is at it's best when N >> k
and it is at it's worst when N << k
(This is assuming that the overall operator matrix cannot be directly applied to the state because it is too large to fit in memory and must be broken down). Is that right or am I missing something ?
@MichaelBroughton , if my understanding is right, the setup is a loss function as the sum of k observables for the same circuit state output $\ket \psi_N$? If so, $\frac{\partial L}{\partial \psi_N} = \sum_i^k \frac{\partial \langle O_i\rangle}{\partial \psi_n}$, and such value can be stored with only the space of size $\psi_N$. If this is the case, the back propagation can be accomplished from gradient $\frac{\partial L}{\partial \psi_N}$ to $\frac{\partial L}{\partial \theta_i}$ in one reverse+backward pass. Therefore, backpropagations are repeated k times only in the last step $\psi_N \rightarrow L_i$, and the backpropagation on the circuit part is still carried out once. I am not sure whether this is the point you are discussing, or if I omitted some points of space restrictions in the argument.
Oops, that was an initial misunderstanding of mine. Thank you for pointing it out @refraction-ray . We are now well on our way to finishing the adjoint differentiator :)
Very excited to see AD feature is about to be supported. I guess this might be the first quantum software supporting constant memory AD in python ecosystem (and even with a high performance cpp backend).
There are some features that are easy to add with such AD engine
The above features may not be so relevant with experiments, but they are very important in theoretical and numerical investigations to help us understand more aspects on quantum computing.
Very exciting! Once #365 gets in , the feature will be in tfq-nightly
(which now depends on tf.2.3.0 and not 2.1.0). Regarding your other two points:
I like the idea of adding this. If the logic is fairly similar to this current adjoint method then it would be relatively low effort to add and would really let people do almost whatever they want with compute graphs. I guess I'm not as worried about adding the ability to differentiate here, since it already is clearly not something people can do on a chip, initially I thought we shouldn't lean into supporting these kinds of features that only serve theory and don't have as much immediate experimental relevance, but I guess I see no harm in adding it if it is what people want. It doesn't interfere with the other ops that ARE actually runnable on real hardware which is nice.... What do you think @zaqqwerty ?
Higher order gradients isn't broken in TFQ, it never existed :P. This would be a more difficult feature to add I see we have an open issue for it #285 . At some point someone (maybe me) should look into roughly what sort of changes might be needed in order to incorporate this into all the differentiators in TFQ.
@refraction-ray @MichaelBroughton , Recently I am working on the higher order gradients in TFQ. I've studied that topic for a while, but it was not at the high priority for launch. Parameter-shift Hessians are easy to add, but for general case like n-th order gradients requires many things to reduce the time complexity because it has lots of duplications. Well, if we support just finite differencing for n-th order gradients, it may be easy, but this case is related to the 1st question like "direct differentiation on the wavefunction", which is only for simulation, not for quantum devices.
While not really viable for NiSQ circuits (outside of simulation), the request has been made a few times to implement the adjoint gradient calculation method for differentiation. This would roughly entail:
differentiate_analytic
.Going to leave this open for discussion for anyone who's interested.