lucidrains / audiolm-pytorch

Implementation of AudioLM, a SOTA Language Modeling Approach to Audio Generation out of Google Research, in Pytorch
MIT License
2.36k stars 255 forks source link

Why is this sampling rate used in training CoarseTransformer? #108

Closed lzl1456 closed 1 year ago

lzl1456 commented 1 year ago

class CoarseTransformerTrainer(nn.Module): self.ds = SoundDataset( folder, max_length = data_max_length, target_sample_hz = ( wav2vec.target_sample_hz, soundstream.target_sample_hz ), # need 2 waves resampled differently here seq_len_multiple_of = soundstream.seq_len_multiple_of )

wav2vec.target_sample_hz, default 50000

With this, the input data using hubert would be a sample rate of 50k, Because the data is resampled data_tuple = tuple((resample(d, sample_hz, target_sample_hz) if exists(target_sample_hz) else d) for d, target_sample_hz in zip(data, self.target_sample_hz)) I'm not sure if there's a problem

lucidrains commented 1 year ago

@lzl1456 i'm not sure about this either, as i am far from an audio expert

but my thought was because hubert (or wav2vec) was trained at a different sampling frequency than soundstream, the dataset needs to resample the input to two waveforms correctly, otherwise the downstream pretrained networks will not function correctly

eonglints commented 1 year ago

Yeah, it's good to have but it should default to 16kHz - this is what Hubert operates at. I believe Hubert will resample if the audio is not 16kHz but better to be on the safe side.

lucidrains commented 1 year ago

@eonglints ohh, speaking of someone who is knowledgeable :pray:

let me go make the changes! thanks!

cyanbx commented 1 year ago

Actually HuBERT won't automatically resample the input if you don't provide the sample rate. So you should not only change the default target sample rate, but also pass the original input sample rate to Hubert when calling it.

I'm wondering how much performance degradance of HuBERT the sample rate mismatch will bring in. Suppose the HuBERT is trained at 16kHz while used at 24kHz, will the semantic information be damaged severely?

lucidrains commented 1 year ago

@cyanbx ohh, so i actually take care of resampling within the SoundDataset class, so setting the default target sample rate should autoconvert it before it is even passed in

but good to know that HuBERT won't take care of this automatically!

also, correct me if i'm wrong, but when you train on audio files, each file could have different input sampling freqs? or do you typically preprocess the entire dataset first to one before training?

cyanbx commented 1 year ago

@lucidrains I use a relatively small dataset with consistent sample rate, but I'm not sure if it applys to others.

lucidrains commented 1 year ago

@cyanbx ahh got it, yea, this may not apply as we scale up

yea, i'll think about how to best handle this when resampling within the model classes themselves