utter-project / fairseq

This is a fork of the original fairseq repository (version 0.12.2) with added classes for training mHuBERT-147.
Other
12 stars 2 forks source link

get clustering labels #4

Open taalua opened 3 days ago

taalua commented 3 days ago

Hi, Thank you for your excellent work.

I want to extract labels from the features extracted from mHubert-147 checkpoint_best.pt, using the existing k-means model. I tried to follow the script in https://github.com/utter-project/mHuBERT-147-scripts/blob/main/03_faiss_indices/apply_index_per_file.py

However, I am not sure of the *.len file. Can you explain how to get this file?

Thank you.

mzboito commented 3 days ago

Hello,

Many thanks for the interest in our work!

Please be aware that the available k-means model (https://huggingface.co/utter-project/mHuBERT-147/blob/main/mhubert147_faiss.index) was trained on the 2nd iteration model (available here: https://huggingface.co/utter-project/mHuBERT-147-base-2nd-iter). It was NOT trained on the checkpoint_best.pt from the 3rd iteration.

If you want to generate discrete labels using features from the 3rd iteration model (https://huggingface.co/utter-project/mHuBERT-147/blob/main/checkpoint_best.pt) you will need to train a new k-means model.

The procedure for doing so is the following:

  1. Create manifest files for your data and extract features using this script: https://github.com/utter-project/mHuBERT-147-scripts/blob/main/02_feature_extraction/hubert_feature_extraction.sh (this will produce .npy and .len features)
  2. Train the model using your extracted features: https://github.com/utter-project/mHuBERT-147-scripts/tree/main/03_faiss_indices
  3. Extract the labels using the script you mentioned (https://github.com/utter-project/mHuBERT-147-scripts/blob/main/03_faiss_indices/apply_index_per_file.py)
taalua commented 2 days ago

Thank you for your prompt response. Just to clarify, If I want to get discrete label from 2nd iteration, I can use this script https://github.com/utter-project/mHuBERT-147-scripts/blob/main/03_faiss_indices/apply_index_per_file.py

Thank you.

mzboito commented 2 days ago

Yes. You just need to extract the features first!