Redshift-dependent encoding

The robust encoding we've been trying out is hard to train because the mapping of CNN-reduced features the latents is pre-determined. This might not work as intended because we don't allow the attention ehads to shoose the features they need for a good estimation. In principle we want something else anyway: we'd like that each attention head works with as many features as possible, but changes their weights as a function of redshift to compensate for some of them getting redshifted out of the observing window. I suggest we solve this more directly.

Instead of multiple heads with a subset of the CNN features, we add a small MLP with 256 CNN features + redshift -> latent space. In other words we make the final compression conditional on redshift. This mimics what a custom redshifting code would do, namely chose which features can be used at any given redshift. Since we provide the decoder with the redshift already, this doesn't add any extra requirements to the autoencoder, but provides much more guidance to the encoder.

pmelchior / spender

Redshift-dependent encoding #13