get clustering labels - Githubissues

taalua commented 3 days ago

Hi, Thank you for your excellent work.

I want to extract labels from the features extracted from mHubert-147 checkpoint_best.pt, using the existing k-means model. I tried to follow the script in https://github.com/utter-project/mHuBERT-147-scripts/blob/main/03_faiss_indices/apply_index_per_file.py

However, I am not sure of the *.len file. Can you explain how to get this file?

Thank you.

mzboito commented 3 days ago

Hello,

Many thanks for the interest in our work!

Please be aware that the available k-means model (https://huggingface.co/utter-project/mHuBERT-147/blob/main/mhubert147_faiss.index) was trained on the 2nd iteration model (available here: https://huggingface.co/utter-project/mHuBERT-147-base-2nd-iter). It was NOT trained on the checkpoint_best.pt from the 3rd iteration.

If you want to generate discrete labels using features from the 3rd iteration model (https://huggingface.co/utter-project/mHuBERT-147/blob/main/checkpoint_best.pt) you will need to train a new k-means model.

The procedure for doing so is the following:

taalua commented 2 days ago

Thank you for your prompt response. Just to clarify, If I want to get discrete label from 2nd iteration, I can use this script https://github.com/utter-project/mHuBERT-147-scripts/blob/main/03_faiss_indices/apply_index_per_file.py

Thank you.

mzboito commented 2 days ago

Yes. You just need to extract the features first!

utter-project / fairseq