microsoft / SpeechT5

Unified-Modal Speech-Text Pre-Training for Spoken Language Processing
MIT License
1.16k stars 113 forks source link

SpeechT5: How to get speaker embeddings ? #16

Closed Arrivederci closed 1 year ago

Arrivederci commented 1 year ago

Hi, I found that there must be 3 columns in the audio manifest tsv file. Is there a tutorial or example on how to get the speaker embedding using my own dataset? Is it possible to pretrain a model on a dataset without speaker label? Thanks 😊

mechanicalsea commented 1 year ago

Hi, Arrivederci.

We use speechbrain/spkrec-xvect-voxceleb on huggingface to get the speaker embedding.

Here is example to extract speaker embeddings: https://huggingface.co/mechanicalsea/speecht5-vc/blob/main/manifest/utils/prep_cmu_arctic_spkemb.py

Wish it helps.

Arrivederci commented 1 year ago

That's helpful, thanks.

StephennFernandes commented 1 year ago

@mechanicalsea @Arrivederci hey did you guys get it working: https://huggingface.co/mechanicalsea/speecht5-vc/blob/main/manifest/utils/prep_cmu_arctic_spkemb.py

the dataset ie: wav files that are used are not of english btw they are multilingual indian languages

the following is how the output in the terminal looks like:

cmu_us_bdl_arctic 0 utterances.
cmu_us_clb_arctic 0 utterances.
cmu_us_rms_arctic 0 utterances.
cmu_us_slt_arctic 0 utterances.

and in the corresponding output dir there is nothing, its blank

Am i doing something wrong, im new to xvectors and and dont know much about them

mechanicalsea commented 1 year ago

@mechanicalsea @Arrivederci hey did you guys get it working: https://huggingface.co/mechanicalsea/speecht5-vc/blob/main/manifest/utils/prep_cmu_arctic_spkemb.py

the dataset ie: wav files that are used are not of english btw they are multilingual indian languages

the following is how the output in the terminal looks like:

cmu_us_bdl_arctic 0 utterances.
cmu_us_clb_arctic 0 utterances.
cmu_us_rms_arctic 0 utterances.
cmu_us_slt_arctic 0 utterances.

and in the corresponding output dir there is nothing, its blank

Am i doing something wrong, im new to xvectors and and dont know much about them

The script as you used is to extracted xvectors from cmu_us_bdl_arctic,cmu_us_clb_arctic,cmu_us_rms_arctic,cmu_us_slt_arctic.

If you are going to extract xvectors from other speakers, specifying the --splits as you want.

StephennFernandes commented 1 year ago

@mechanicalsea just to get some context here as i am pretty new to this area, how would i need to arrange my files into the directory to get xvectors on speaker embeddings ? should i run this scirpt on all the unsupervised data that i have entirely that i need to use for pretraining speechT5 or is this script only used to get speaker xvectors on a subset of data,

until now i have generated the wav2vec2 manifest, mfcc_feature, and hubert feature based upon the speechT5 pretraining documentation where i placed all my unsupervised data into a dir : pretrain_data/*.wav (thats cut to 30 seconds and sampled at 16k)

but i am slightly confused how to use the speaker embeddings

mechanicalsea commented 1 year ago

The script is only used for fine-tuning the voice conversion task, and it's helpful to reimplement extracting xvectors for any waveforms. Specifically, for pre-training's speaker embeddings, you can use the same speaker model to extract xvectors.

StephennFernandes commented 1 year ago

@mechanicalsea okay i get it. so now in my usecase where i need those extracted xvector for pretraining speechT5 using my wav files how do i run the script to extract xvectors what should the --split arg be ? , is there any additional arg that i needs to change

mechanicalsea commented 1 year ago

Using the script may raise errors. Recommend you reimplement extracting xvectors for pre-training datasets as def f2embed with the xvector model.

StephennFernandes commented 1 year ago

hey @mechanicalsea

i build something like this , and got all the xvectors. Could you please confirm and validate if what i did was correct, dont want to end up in a wrong implementation as i am new to this.

import os
import glob
import numpy
import argparse
import torchaudio
from speechbrain.pretrained import EncoderClassifier
import torch
from tqdm import tqdm
import torch.nn.functional as F

spk_model = {
    "speechbrain/spkrec-xvect-voxceleb": 512, 
    "speechbrain/spkrec-ecapa-voxceleb": 192,
}

def f2embed(wav_file, classifier, size_embed):
    signal, fs = torchaudio.load(wav_file)
    assert fs == 16000, fs
    with torch.no_grad():
        embeddings = classifier.encode_batch(signal)
        embeddings = F.normalize(embeddings, dim=2)
        embeddings = embeddings.squeeze().cpu().numpy()
    assert embeddings.shape[0] == size_embed, embeddings.shape[0]
    return embeddings

def process(args):
    wavlst = []
    for ext in ["wav", "flac"]:
        wav_dir = os.path.join(args.input_root, "*." + ext)
        wavlst_split = glob.glob(wav_dir)
        print(f"{ext.upper()} {len(wavlst_split)} utterances.")
        wavlst.extend(wavlst_split)

    spkemb_root = args.output_root
    if not os.path.exists(spkemb_root):
        print(f"Create speaker embedding directory: {spkemb_root}")
        os.makedirs(spkemb_root)
    device = "cuda" if torch.cuda.is_available() else "cpu"
    classifier = EncoderClassifier.from_hparams(source=args.classifier_model, run_opts={"device": device}, savedir=os.path.join('/tmp', args.classifier_model))
    size_embed = spk_model[args.classifier_model]
    for wav_file in tqdm(wavlst, total=len(wavlst), desc="Extracting XVectors"):
        basename = os.path.splitext(os.path.basename(wav_file))[0]
        xvector = f2embed(wav_file, classifier, size_embed)
        numpy.save(os.path.join(spkemb_root, f"{basename}.npy"), xvector)

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--input-root", "-i", required=True, type=str, help="Input directory containing WAV files.")
    parser.add_argument("--output-root", "-o", required=True, type=str, help="Output directory for speaker embeddings.")
    parser.add_argument("--classifier-model", "-c", type=str, required=True, default="speechbrain/spkrec-xvect-voxceleb",
                        choices=["speechbrain/spkrec-xvect-voxceleb", "speechbrain/spkrec-ecapa-voxceleb"],
                        help="Pretrained model for extracting speaker embeddings.")
    args = parser.parse_args()
    print(f"Extracting XVectors from {args.input_root}, "
        + f"and saving to {args.output_root}, "
        + f"using the {args.classifier_model} classifier with {spk_model[args.classifier_model]} size.")
    process(args)

if __name__ == "__main__":
    """
    python get_xvectors.py \
        -i /path/to/wav/files \
        -o /path/to/output/dir \
        -c speechbrain/spkrec-xvect-voxceleb
    """
    main()
mechanicalsea commented 1 year ago

As you got all the xvectors and used speechbrain/spkrec-xvect-voxceleb model as well as def f2embed function, I think it would be good.

StephennFernandes commented 1 year ago

@mechanicalsea as my training dataset is multilingual that are multiple indian languages, but the model used is speechbrain/spkrec-xvect-voxceleb will there be any issues with performance and accuracy of the xvector ? will this have any impact on my speechT5 pretraining ?

mechanicalsea commented 1 year ago

@mechanicalsea as my training dataset is multilingual that are multiple indian languages, but the model used is speechbrain/spkrec-xvect-voxceleb will there be any issues with performance and accuracy of the xvector ? will this have any impact on my speechT5 pretraining ?

could you post an another issue about extracting multilingual language speaker embedding?