If we are reusing weights in a linear layer can we use the same approximation to compute the covariances, or are there some subtleties?
for example if weights w are used 4x we can compute \Omega as (1/(4M)) A A^T where M is the batch size
deriving from the definition of a fisher block and assuming spatially uncorrelated derivatives seems to land you in the same place as the convolutional approximation
The type of approximation they use is the same one, essentially. However, you can't use them interchangeably in the code since convs involve special ops and also "reuse" the different patches of their input vectors.
If we are reusing weights in a linear layer can we use the same approximation to compute the covariances, or are there some subtleties?
for example if weights w are used 4x we can compute \Omega as (1/(4M)) A A^T where M is the batch size
deriving from the definition of a fisher block and assuming spatially uncorrelated derivatives seems to land you in the same place as the convolutional approximation