Closed lzl1456 closed 1 year ago
@lzl1456 i'm not sure about this either, as i am far from an audio expert
but my thought was because hubert (or wav2vec) was trained at a different sampling frequency than soundstream, the dataset needs to resample the input to two waveforms correctly, otherwise the downstream pretrained networks will not function correctly
Yeah, it's good to have but it should default to 16kHz - this is what Hubert operates at. I believe Hubert will resample if the audio is not 16kHz but better to be on the safe side.
@eonglints ohh, speaking of someone who is knowledgeable :pray:
let me go make the changes! thanks!
Actually HuBERT won't automatically resample the input if you don't provide the sample rate. So you should not only change the default target sample rate, but also pass the original input sample rate to Hubert when calling it.
I'm wondering how much performance degradance of HuBERT the sample rate mismatch will bring in. Suppose the HuBERT is trained at 16kHz while used at 24kHz, will the semantic information be damaged severely?
@cyanbx ohh, so i actually take care of resampling within the SoundDataset
class, so setting the default target sample rate should autoconvert it before it is even passed in
but good to know that HuBERT won't take care of this automatically!
also, correct me if i'm wrong, but when you train on audio files, each file could have different input sampling freqs? or do you typically preprocess the entire dataset first to one before training?
@lucidrains I use a relatively small dataset with consistent sample rate, but I'm not sure if it applys to others.
@cyanbx ahh got it, yea, this may not apply as we scale up
yea, i'll think about how to best handle this when resampling within the model classes themselves
class CoarseTransformerTrainer(nn.Module):
self.ds = SoundDataset( folder, max_length = data_max_length, target_sample_hz = ( wav2vec.target_sample_hz, soundstream.target_sample_hz ), # need 2 waves resampled differently here seq_len_multiple_of = soundstream.seq_len_multiple_of )
wav2vec.target_sample_hz, default 50000
With this, the input data using hubert would be a sample rate of 50k, Because the data is resampled
data_tuple = tuple((resample(d, sample_hz, target_sample_hz) if exists(target_sample_hz) else d) for d, target_sample_hz in zip(data, self.target_sample_hz))
I'm not sure if there's a problem