Closed dyyoungg closed 4 days ago
I change the downsamples to [8,5,4,2], so the token is 16000/(8 5 4 * 2)=50, and I change the hop_length to 320, n_fft to 1280, then everything is ok. Is the config reasonable?
I change the downsamples to [8,5,4,2], so the token is 16000/(8 5 4 * 2)=50, and I change the hop_length to 320, n_fft to 1280, then everything is ok. Is the config reasonable?
You may observe that downsampling factors must be divisors of the sampling rate. This explains why one configuration is correct, while the other is not.
I change the downsamples to [8,5,4,2], so the token is 16000/(8 5 4 * 2)=50, and I change the hop_length to 320, n_fft to 1280, then everything is ok. Is the config reasonable?
You may observe that downsampling factors must be divisors of the sampling rate. This explains why one configuration is correct, while the other is not.
Yeah, I noticed that. Thanks for your reply!
Thanks for your great work! I want to train wavtokenizer with my own datasets in 16kHZ, but encounter tensor shape incosistent in the following code
I checked the model output and origin audio shape, which are 64200 and 64000 respectively. The following is my config
Is there any mistake or misunderstanding in my settings above that causes the output shape of the model to be inconsistent with the input shape?