hubertsiuzdak / snac

Multi-Scale Neural Audio Codec (SNAC) compresses audio into discrete codes at a low bitrate
https://hubertsiuzdak.github.io/snac/
MIT License
240 stars 16 forks source link

Architectural questions (SNAC vs Vocos) #13

Open zaptrem opened 2 months ago

zaptrem commented 2 months ago

Hello, why did you switch back to building on DAC-style waveform outputs for SNAC after showing it's possible to completely get rid of aliasing by generating complex spectrograms with Vocos? I'm also curious about the answer to the questions posed here. My best guess for the NoiseBlock is reducing high frequency artifacts with staticky sounds like cymbals?

Thanks!

hubertsiuzdak commented 2 months ago

Hello! I initially tried using a Vocos-like autoencoder, but my first attempts weren't quite successful. So I switched back to a strong (time-domain) baseline to experiment with different VQ methods.

But recently I experimented with a new encoder (STFT + DWConv) that makes it super fast to tokenize audio. And it works! Now, I have to try training just a Vocos decoder (with the encoder and codebook frozen). If this is successful I might push it to this repo.

I believe it makes sense to train these models in two stages. I remember when I was training Vocos on EnCodec tokens, I had to keep the embeddings frozen to achieve the highest quality. My intuition is that discriminator gradients might harm codebook learning. However somehow the time-domain models seem easier to train end-to-end. Or I'm just too lazy to do a hyperparms search

zaptrem commented 1 month ago

Hello! I initially tried using a Vocos-like autoencoder, but my first attempts weren't quite successful. So I switched back to a strong (time-domain) baseline to experiment with different VQ methods.

But recently I experimented with a new encoder (STFT + DWConv) that makes it super fast to tokenize audio. And it works! Now, I have to try training just a Vocos decoder (with the encoder and codebook frozen). If this is successful I might push it to this repo.

I believe it makes sense to train these models in two stages. I remember when I was training Vocos on EnCodec tokens, I had to keep the embeddings frozen to achieve the highest quality. My intuition is that discriminator gradients might harm codebook learning. However somehow the time-domain models seem easier to train end-to-end. Or I'm just too lazy to do a hyperparms search

Thanks! Do you think in EnCodec/DAC the discriminator gradients make the encoder encode something useful it otherwise wouldn't in a pure mel loss situation? Also, have you had any luck getting your new STFT encoder or Vocos to work natively with stereo outputs? I've been messing with the (Vocos) architecture but haven't been able to get it to outperform DAC.