About VQ and the datasets

Hi, thanks a lot for the great work, the use of tokens with different resolutions is in line with the intuitive understanding of the audio signal (like a representation of steady state features and transient features). Here I have a question: doesn't the use of coarse tokens lead to longer latency, because lower sampling frequency tokens need to read in more buffer information.

Also, can I know on which datasets the open pre-training models are trained? Much appreciated.