Open zaptrem opened 2 months ago
Hello! I initially tried using a Vocos-like autoencoder, but my first attempts weren't quite successful. So I switched back to a strong (time-domain) baseline to experiment with different VQ methods.
But recently I experimented with a new encoder (STFT + DWConv) that makes it super fast to tokenize audio. And it works! Now, I have to try training just a Vocos decoder (with the encoder and codebook frozen). If this is successful I might push it to this repo.
I believe it makes sense to train these models in two stages. I remember when I was training Vocos on EnCodec tokens, I had to keep the embeddings frozen to achieve the highest quality. My intuition is that discriminator gradients might harm codebook learning. However somehow the time-domain models seem easier to train end-to-end. Or I'm just too lazy to do a hyperparms search
Hello! I initially tried using a Vocos-like autoencoder, but my first attempts weren't quite successful. So I switched back to a strong (time-domain) baseline to experiment with different VQ methods.
But recently I experimented with a new encoder (STFT + DWConv) that makes it super fast to tokenize audio. And it works! Now, I have to try training just a Vocos decoder (with the encoder and codebook frozen). If this is successful I might push it to this repo.
I believe it makes sense to train these models in two stages. I remember when I was training Vocos on EnCodec tokens, I had to keep the embeddings frozen to achieve the highest quality. My intuition is that discriminator gradients might harm codebook learning. However somehow the time-domain models seem easier to train end-to-end. Or I'm just too lazy to do a hyperparms search
Thanks! Do you think in EnCodec/DAC the discriminator gradients make the encoder encode something useful it otherwise wouldn't in a pure mel loss situation? Also, have you had any luck getting your new STFT encoder or Vocos to work natively with stereo outputs? I've been messing with the (Vocos) architecture but haven't been able to get it to outperform DAC.
Hello, why did you switch back to building on DAC-style waveform outputs for SNAC after showing it's possible to completely get rid of aliasing by generating complex spectrograms with Vocos? I'm also curious about the answer to the questions posed here. My best guess for the NoiseBlock is reducing high frequency artifacts with staticky sounds like cymbals?
Thanks!