I see that the Midi Autoencoder encodes all the F0 series, loudness series, harmonic distribution, and filtered noise down to a MIDI representation, then decodes it back to a F0 series and loudness series. I understand that the goal is to use the MIDI decoder at inference time when all you have is MIDI information and you need detailed F0/loudness series.
What I don't understand is how the MIDI Autoencoder is constrained to actually produce MIDI information in its latent space during training. I see that the encoder is forced to bottleneck its input information down to a representation that has the correct dimensionality for MIDI, and I see how the MIDI decoder transfers it back into the inputs that DDSP resynthesis requires. However, I don't understand how the model is constrained to actually produce MIDI information in the middle. As far as I understand, it could produce any latent representation in that dimensionality.
I see that the Midi Autoencoder encodes all the F0 series, loudness series, harmonic distribution, and filtered noise down to a MIDI representation, then decodes it back to a F0 series and loudness series. I understand that the goal is to use the MIDI decoder at inference time when all you have is MIDI information and you need detailed F0/loudness series.
What I don't understand is how the MIDI Autoencoder is constrained to actually produce MIDI information in its latent space during training. I see that the encoder is forced to bottleneck its input information down to a representation that has the correct dimensionality for MIDI, and I see how the MIDI decoder transfers it back into the inputs that DDSP resynthesis requires. However, I don't understand how the model is constrained to actually produce MIDI information in the middle. As far as I understand, it could produce any latent representation in that dimensionality.
Let me know if I can clarify in any way.