probability density for each index in the codebook

jishengpeng / WavTokenizer

SOTA discrete acoustic codec models with 40 tokens per second for audio language modeling

MIT License

830 stars 46 forks source link

In Section 3.2 of the paper, you presented the probability density for each index in the codebook. Could you explain how this was achieved? Also, during this process, were you training with 40 Tokens/s or 75 Tokens/s? In the experiments section, you mentioned: "We train WavTokenizer-small up to 1 million iterations, with 500 thousand iterations allocated to both the generator and discriminator on 16 NVIDIA A800 80G GPUs." Could you share the GPU utilization during training? If I were to train with a single 40GB GPU, what would the estimated training time be?

In the 75-token version, we calculated the frequency of each codebook entry to derive the probability density. The GPU utilization was close to 100%, and training the LibriTTS version on a single 40GB GPU may take several weeks.

jishengpeng / WavTokenizer

probability density for each index in the codebook #41