training time for pre-train

crzdg commented 7 months ago

Hi

Interesting and great work. Can you elaborate a bit about training time for this? How long did it take to pre-train the VQ-VAE and then the VQ-MAE-S?

Further, in the paper I think there is a miss alignment for the text and figure 1. For the the output of the VQ-VAE encoder the text refers to $\mathbb{Z}$, were as the figure refers to $\mathbb{R}$ through out. I guess the figure includes a copy-paste error.

crzdg commented 7 months ago

Okay just saw that the image in this repos README actually is correct / fixed.

samsad35 commented 7 months ago

Hi @crzdg

Thank you for your interest in our work.

The training time for VQ-VAE is approximately 12 hours using a GeForce RTX 2070 Super. Alternatively, more sophisticated architectures can be employed for quantifying audio speech.
For VQ-MAE, the training time is around 24 hours on four A100 GPUs. However, it's worth noting that all quantized indexes in the VoxCeleb audio files have been pre-processed and stored in an H5 file to accelerate data retrieval. Without this pre-processing, the training time could extend to 3-4 days or more.
Thank you for bringing the error in the figure to our attention. I will make sure to rectify it promptly. And yes, the figure in the readme is the correct one.

crzdg commented 7 months ago

Thank you very much for the answers.

samsad35 / VQ-MAE-S-code

training time for pre-train #4