Closed crzdg closed 7 months ago
Okay just saw that the image in this repos README actually is correct / fixed.
Hi @crzdg
Thank you for your interest in our work.
The training time for VQ-VAE is approximately 12 hours using a GeForce RTX 2070 Super. Alternatively, more sophisticated architectures can be employed for quantifying audio speech.
For VQ-MAE, the training time is around 24 hours on four A100 GPUs. However, it's worth noting that all quantized indexes in the VoxCeleb audio files have been pre-processed and stored in an H5 file to accelerate data retrieval. Without this pre-processing, the training time could extend to 3-4 days or more.
Thank you for bringing the error in the figure to our attention. I will make sure to rectify it promptly. And yes, the figure in the readme is the correct one.
Thank you very much for the answers.
Hi
Interesting and great work. Can you elaborate a bit about training time for this? How long did it take to pre-train the VQ-VAE and then the VQ-MAE-S?
Further, in the paper I think there is a miss alignment for the text and figure 1. For the the output of the VQ-VAE encoder the text refers to $\mathbb{Z}$, were as the figure refers to $\mathbb{R}$ through out. I guess the figure includes a copy-paste error.