Robust latents and FP16

This is a cumulative PR, comprising

A robust estimation of latent features by splitting 256 CNN features into K attention heads, each of which aims to estimate the same n_latent features, which are then combined by a softmedoid function to provide outlier control. This is necessary to learn encodings that can toleration that some spectral features get obscured by noise or get redshifted out of the observed window.
Memory savings by using automatic mixed precision with the FP16 datatype (so far only used in the fp16_train.py script).
A further extension of 1. to utilize the same 256 features multiple times in different attention heads. This is meant to prevent that an important spectral feature that is observed gets "masked" because another features in the same head is absent or corrupted.

pmelchior / spender