The similarity and consistency losses (as written in Liang+2023) assume that the latents have typically amplitudes of order 1. This is not guaranteed by the fidelity training, but if that's not the case it will screw up the extended training procedure by pushing the sigmoids into the flat regime.
This can be fixed by adding rescaling terms that are computed from the typical latent space amplitude:
The first RHS terms should have a prefactor $1/(\sigma_s^2 S)$ instead of $1/S$, in the same way as
This ensures that these terms are all of order 1 and thus remain in the active parts of the sigmoids.
In Liang+23, we set $\sigma_s=0.1$, to set a target size of the consistency loss. It's better to make both of these rescaling terms dynamic, i.e. measure the typical value of $\lVert s\rVert$ across the data set, and update it during the training to account for any shrinking or expansion.
This also has the advantage of
preventing the latent distribution collapse from the consistency term because overall shrinkage does not improve $L_c$
making it easier for the autoencoder to achieve redshift invariance by removing the latent shrinking from the consistency term.
The similarity and consistency losses (as written in Liang+2023) assume that the latents have typically amplitudes of order 1. This is not guaranteed by the fidelity training, but if that's not the case it will screw up the extended training procedure by pushing the sigmoids into the flat regime.
This can be fixed by adding rescaling terms that are computed from the typical latent space amplitude:
The first RHS terms should have a prefactor $1/(\sigma_s^2 S)$ instead of $1/S$, in the same way as
This ensures that these terms are all of order 1 and thus remain in the active parts of the sigmoids.
In Liang+23, we set $\sigma_s=0.1$, to set a target size of the consistency loss. It's better to make both of these rescaling terms dynamic, i.e. measure the typical value of $\lVert s\rVert$ across the data set, and update it during the training to account for any shrinking or expansion.
This also has the advantage of