stale gradients problem

If I didn't get it wrong, there might be a subtle problem in applying gradients to FPN's trainable variables.

When optimizing FPN, the application of gradients w.r.t. FPN's trainable variables is separated into 2 stages: first dW1 (from the 1-Wasserstein loss) and then the entropy. After the first optimization, the trainables would have changed. What I mean is: entropy is calculated based on the old trainables but applied to the new trainables. I'm not sure, but is this the so-called stale gradients problem?

Hope to respond

microsoft / FQF

stale gradients problem #3