Question About Midi Autoencoder

I see that the Midi Autoencoder encodes all the F0 series, loudness series, harmonic distribution, and filtered noise down to a MIDI representation, then decodes it back to a F0 series and loudness series. I understand that the goal is to use the MIDI decoder at inference time when all you have is MIDI information and you need detailed F0/loudness series.

What I don't understand is how the MIDI Autoencoder is constrained to actually produce MIDI information in its latent space during training. I see that the encoder is forced to bottleneck its input information down to a representation that has the correct dimensionality for MIDI, and I see how the MIDI decoder transfers it back into the inputs that DDSP resynthesis requires. However, I don't understand how the model is constrained to actually produce MIDI information in the middle. As far as I understand, it could produce any latent representation in that dimensionality.

Let me know if I can clarify in any way.

magenta / ddsp

Question About Midi Autoencoder #511