Adjoint Differentiation

MichaelBroughton commented 4 years ago

While not really viable for NiSQ circuits (outside of simulation), the request has been made a few times to implement the adjoint gradient calculation method for differentiation. This would roughly entail:

Writing a new C++ op to calculate gradients via the adjoint method
Writing a new differentiator that uses this op inside of it and only defines this operation for differentiate_analytic.

Going to leave this open for discussion for anyone who's interested.

refraction-ray commented 4 years ago

This will definitely be a great feature especially in the context of variational circuit simulations where gradients can be obtained more quickly compared to finite difference or parameter shift (which is also somehow finite difference) when the number of variational params is large.

I also want to mention that backward AD in circuit case (reversible) can be implemented in a more memory efficient way, i.e. the memory required can be reduced to constant (width of circuit) compared to general adjoint methods where the memory consumption is proportional to the depth of network/graph/circuit. One can recompute the intermediate value in backward pass since the computation is reversible without keep them in forward pass. This can in principle enable AD of very deep circuit.

MichaelBroughton commented 4 years ago

I also want to mention that backward AD in circuit case (reversible) can be implemented in a more memory efficient way, i.e. the memory required can be reduced to constant (width of circuit) compared to general adjoint methods where the memory consumption is proportional to the depth of network/graph/circuit. One can recompute the intermediate value in backward pass since the computation is reversible without keep them in forward pass. This can in principle enable AD of very deep circuit.

This sounds really cool! Would you mind elaborating on this method you are describing ? Suppose I have this circuit

(0, 0): ───H^alpha───@───
                     │
(0, 1): ───X^beta────X───

and I want to compute the gradient of <psi | X(0,0) * X(0, 1) | psi> w.r.t alpha and beta where lets say alpha = 0.123 and beta = 0.456. Would you mind walking through this example to help me understand what you have in mind and how it works to save memory and make things more efficient here ( I didn't exactly follow what you mentioned above ) ?

refraction-ray commented 4 years ago

Hi, @MichaelBroughton , I have just written some short notes on this but still in an abstract fashion. I don't know whether it is clear enough or make sense to you. If you still have any questions, we can further discuss on that and maybe go through the detailed example you presented above.

wangleiphy commented 4 years ago

You can also refer to Section 3 "Reversible Computing and Automatic Differentiation" of the Yao.jl paper . This is the key technique that allows Yao.jl to train a, say, 10000 layer variational circuit on a laptop.

MichaelBroughton commented 3 years ago

Thanks @refraction-ray and @wangleiphy . I Did some more reading and spoke with @jaeyoo about this. Think we are almost ready to go ahead and try to implement this. Just wanted to confirm my understanding on something:

Suppose I have a cirq.PauliSum with k different cirq.PauliStrings (terms) in it, Who's overall matrix I cannot store in memory.

It is my understanding that if I have a circuit with N parameters and I add one more parameter so I have N + 1 parameters, the cost of differentiation for that circuit will go up by roughly ~O(k) more inner product calculations. This means that adjoint is at it's best when N >> k and it is at it's worst when N << k (This is assuming that the overall operator matrix cannot be directly applied to the state because it is too large to fit in memory and must be broken down). Is that right or am I missing something ?

refraction-ray commented 3 years ago

@MichaelBroughton , if my understanding is right, the setup is a loss function as the sum of k observables for the same circuit state output $\ket \psi_N$? If so, $\frac{\partial L}{\partial \psi_N} = \sum_i^k \frac{\partial \langle O_i\rangle}{\partial \psi_n}$, and such value can be stored with only the space of size $\psi_N$. If this is the case, the back propagation can be accomplished from gradient $\frac{\partial L}{\partial \psi_N}$ to $\frac{\partial L}{\partial \theta_i}$ in one reverse+backward pass. Therefore, backpropagations are repeated k times only in the last step $\psi_N \rightarrow L_i$, and the backpropagation on the circuit part is still carried out once. I am not sure whether this is the point you are discussing, or if I omitted some points of space restrictions in the argument.

MichaelBroughton commented 3 years ago

Oops, that was an initial misunderstanding of mine. Thank you for pointing it out @refraction-ray . We are now well on our way to finishing the adjoint differentiator :)

refraction-ray commented 3 years ago

Very excited to see AD feature is about to be supported. I guess this might be the first quantum software supporting constant memory AD in python ecosystem (and even with a high performance cpp backend).

There are some features that are easy to add with such AD engine

Gradient support on wavefunction amplitude: currently wavefunction is not differentiable directly in tfq, but supporting this level of abstract has may benefits: one can define more flexible objective rather than expectation value that is also AD-aware.
Higher order gradient support: currently higher order gradient support is broken with parameter shift differentiator, it would be nice to have this feature at least for AD differentiator.

The above features may not be so relevant with experiments, but they are very important in theoretical and numerical investigations to help us understand more aspects on quantum computing.

MichaelBroughton commented 3 years ago

Very exciting! Once #365 gets in , the feature will be in tfq-nightly (which now depends on tf.2.3.0 and not 2.1.0). Regarding your other two points:

I like the idea of adding this. If the logic is fairly similar to this current adjoint method then it would be relatively low effort to add and would really let people do almost whatever they want with compute graphs. I guess I'm not as worried about adding the ability to differentiate here, since it already is clearly not something people can do on a chip, initially I thought we shouldn't lean into supporting these kinds of features that only serve theory and don't have as much immediate experimental relevance, but I guess I see no harm in adding it if it is what people want. It doesn't interfere with the other ops that ARE actually runnable on real hardware which is nice.... What do you think @zaqqwerty ?
Higher order gradients isn't broken in TFQ, it never existed :P. This would be a more difficult feature to add I see we have an open issue for it #285 . At some point someone (maybe me) should look into roughly what sort of changes might be needed in order to incorporate this into all the differentiators in TFQ.

jaeyoo commented 3 years ago

@refraction-ray @MichaelBroughton , Recently I am working on the higher order gradients in TFQ. I've studied that topic for a while, but it was not at the high priority for launch. Parameter-shift Hessians are easy to add, but for general case like n-th order gradients requires many things to reduce the time complexity because it has lots of duplications. Well, if we support just finite differencing for n-th order gradients, it may be easy, but this case is related to the 1st question like "direct differentiation on the wavefunction", which is only for simulation, not for quantum devices.

tensorflow / quantum

Adjoint Differentiation #256