As I understand it, the fisher weights of x for weights are the squares of the gradients d/dw (log x). (Let me know if this is incorrect.)
In my own implementations I use the loss term for x, but in other implementations I see the model prediction used instead. I can't remember why I used this, although I suspect it's because I needed a value >0 (for which raw predictions of a Q-network would not work).
Should the Fisher matrix be based on the raw output rather than the error? (In practice, it seems to work based on the error, which makes sense, although I suspect basing it on the output would work better.)
As I understand it, the fisher weights of x for weights are the squares of the gradients d/dw (log x). (Let me know if this is incorrect.)
In my own implementations I use the loss term for x, but in other implementations I see the model prediction used instead. I can't remember why I used this, although I suspect it's because I needed a value >0 (for which raw predictions of a Q-network would not work).
Should the Fisher matrix be based on the raw output rather than the error? (In practice, it seems to work based on the error, which makes sense, although I suspect basing it on the output would work better.)