lucidrains / voicebox-pytorch

Implementation of Voicebox, new SOTA Text-to-speech network from MetaAI, in Pytorch
MIT License
589 stars 49 forks source link

where to get the kmeans_path = './path/to/hubert/kmeans.bin file? #24

Open furqan4545 opened 11 months ago

furqan4545 commented 11 months ago

I tried to look this file in fairseq repo but I couldn't find. Could you please point out where we can download kmeans.bin file?

kmeans_path = './path/to/hubert/kmeans.bin

lucasnewman commented 11 months ago

@furqan4545 you can find a pretrained quantizer for HuBERT base here, and there are also utilities to create your own quantizer in the simple_kmeans directory from one of the larger models.

itsjamie commented 11 months ago

@lucasnewman, I've written out the tsv file that follows the format of:

/path/to/root
audio.flac    nsamples

And I can successfully extract features from the MLS dataset using MFCC if I wanted to train from scratch. For others, this looks like python3 dump_mfcc_feature.py ~/hubert dev 1 0 ~/hubert/features.

But my understanding is that I need to instead use the dump_hubert_feature.py if I want to do the k-means clustering for a pretrained hubert model. I don't fully grasp the reason behind the {layer} command line argument.

I read through https://jonathanbgn.com/2021/10/30/hubert-visually-explained.html Which seemed to imply that I should be using layer values of 9 for the large model, but I'm not sure.

Would appreciate any tips to resources that explain the intention.

I figured out that nshards is to keep the dataset split, and really intended so that you can train a shard per GPU. So if you are on a single GPU setup, nshared of 1 is fine.

lucasnewman commented 11 months ago

@itsjamie You're on the right track, but for the pretrained models, you won't need the MFCC features, since they are only used in bootstrapping the base model during the first iteration, and the pretrained models have already gone through 2/3 iterations.

The specific output layer you choose for HuBERT features will affect the semantic representation you get, because the model is transforming the inputs progressively through each layer and building a different latent representation.

The WavLM paper (see figure 2) gives a fairly nice overview of layer selection vs task that you could use as a reference, but empirically it seems like layer (n / 2) + 1, where n is the layer 1...N given the model depth is a common choice. You may want to do some experimentation with different layers if you have the time to see what gives you the best performance for your particular task.

I have an example of similar k-means extraction from a HuBERT-like model here that imho is a little easier to follow and might help it click!

itsjamie commented 11 months ago

@itsjamie You're on the right track, but for the pretrained models, you won't need the MFCC features, since they are only used in bootstrapping the base model during the first iteration, and the pretrained models have already gone through 2/3 iterations.

The specific output layer you choose for HuBERT features will affect the semantic representation you get, because the model is transforming the inputs progressively through each layer and building a different latent representation.

The WavLM paper (see figure 2) gives a fairly nice overview of layer selection vs task that you could use as a reference, but empirically it seems like layer (n / 2) + 1, where n is the layer 1...N given the model depth is a common choice. You may want to do some experimentation with different layers if you have the time to see what gives you the best performance for your particular task.

I have an example of similar k-means extraction from a HuBERT-like model here that imho is a little easier to follow and might help it click!

Thank you for the pointers! I'll take a look at the paper and experiment some.

Unrelated, if you happen to have a chance, I had asked a question in Discussions about the coin flip code around choosing between the sequence length vs the dropout mask. I wasn't sure why a coin-flip was chosen, because the paper doesn't seem to define it.

I've emailed one of the authors and was curious about the ", and otherwise" language. This could also just be my lack of knowledge in this area as I try to get up to speed. But if you happen to know, that would be incredible.

lucasnewman commented 11 months ago

@itsjamie I think the coin flip is just to balance the training objectives and make sure both masking variations are being used during training.

With regard to the batch dimension, that's just representing multiple inputs in a single tensor (e.g. several audio files concatenated together), which is important for training the network efficiently to utilize the full computing power of modern GPUs.

itsjamie commented 11 months ago

@lucasnewman thanks for the assistance. I've generated my own kmeans and am in the process of generating a TextToSemantic model.

I have the dev set of MLS acting as my parallel dataset to your configured LibriTTS to see if I understand this at all :)

For others who come after, I decided to experiment and use layer 22 from hubert-large-ll60k.pt, another option I saw was layer 23. Based on what I've read, the decision was based on this:

The WavLM paper (see figure 2) gives a fairly nice overview of layer selection vs task that you could use as a reference, but empirically it seems like layer (n / 2) + 1, where n is the layer 1...N given the model depth is a common choice. You may want to do some experimentation with different layers if you have the time to see what gives you the best performance for your particular task.

The PR and ASR tasks are heavily weighted in those layers, so I figured they were as good a first test as any.

lucasnewman commented 11 months ago

Nice! There are some additional quantizers for HuBERT base that Meta made available here if you want to compare and contrast. Let us know how it goes!