Closed Arrivederci closed 1 year ago
Hi, Arrivederci.
We use speechbrain/spkrec-xvect-voxceleb
on huggingface to get the speaker embedding.
Here is example to extract speaker embeddings: https://huggingface.co/mechanicalsea/speecht5-vc/blob/main/manifest/utils/prep_cmu_arctic_spkemb.py
Wish it helps.
That's helpful, thanks.
@mechanicalsea @Arrivederci hey did you guys get it working: https://huggingface.co/mechanicalsea/speecht5-vc/blob/main/manifest/utils/prep_cmu_arctic_spkemb.py
the dataset ie: wav files that are used are not of english btw they are multilingual indian languages
the following is how the output in the terminal looks like:
cmu_us_bdl_arctic 0 utterances.
cmu_us_clb_arctic 0 utterances.
cmu_us_rms_arctic 0 utterances.
cmu_us_slt_arctic 0 utterances.
and in the corresponding output dir there is nothing, its blank
Am i doing something wrong, im new to xvectors and and dont know much about them
@mechanicalsea @Arrivederci hey did you guys get it working: https://huggingface.co/mechanicalsea/speecht5-vc/blob/main/manifest/utils/prep_cmu_arctic_spkemb.py
the dataset ie: wav files that are used are not of english btw they are multilingual indian languages
the following is how the output in the terminal looks like:
cmu_us_bdl_arctic 0 utterances. cmu_us_clb_arctic 0 utterances. cmu_us_rms_arctic 0 utterances. cmu_us_slt_arctic 0 utterances.
and in the corresponding output dir there is nothing, its blank
Am i doing something wrong, im new to xvectors and and dont know much about them
The script as you used is to extracted xvectors from cmu_us_bdl_arctic,cmu_us_clb_arctic,cmu_us_rms_arctic,cmu_us_slt_arctic
.
If you are going to extract xvectors from other speakers, specifying the --splits
as you want.
@mechanicalsea just to get some context here as i am pretty new to this area, how would i need to arrange my files into the directory to get xvectors on speaker embeddings ? should i run this scirpt on all the unsupervised data that i have entirely that i need to use for pretraining speechT5 or is this script only used to get speaker xvectors on a subset of data,
until now i have generated the wav2vec2 manifest
, mfcc_feature
, and hubert feature
based upon the speechT5 pretraining documentation where i placed all my unsupervised data into a dir : pretrain_data/*.wav
(thats cut to 30 seconds and sampled at 16k)
but i am slightly confused how to use the speaker embeddings
The script is only used for fine-tuning the voice conversion task, and it's helpful to reimplement extracting xvectors for any waveforms. Specifically, for pre-training's speaker embeddings, you can use the same speaker model to extract xvectors.
@mechanicalsea okay i get it. so now in my usecase where i need those extracted xvector for pretraining speechT5 using my wav files how do i run the script to extract xvectors what should the --split arg be ? , is there any additional arg that i needs to change
Using the script may raise errors.
Recommend you reimplement extracting xvectors for pre-training datasets as def f2embed
with the xvector model.
hey @mechanicalsea
i build something like this , and got all the xvectors. Could you please confirm and validate if what i did was correct, dont want to end up in a wrong implementation as i am new to this.
import os
import glob
import numpy
import argparse
import torchaudio
from speechbrain.pretrained import EncoderClassifier
import torch
from tqdm import tqdm
import torch.nn.functional as F
spk_model = {
"speechbrain/spkrec-xvect-voxceleb": 512,
"speechbrain/spkrec-ecapa-voxceleb": 192,
}
def f2embed(wav_file, classifier, size_embed):
signal, fs = torchaudio.load(wav_file)
assert fs == 16000, fs
with torch.no_grad():
embeddings = classifier.encode_batch(signal)
embeddings = F.normalize(embeddings, dim=2)
embeddings = embeddings.squeeze().cpu().numpy()
assert embeddings.shape[0] == size_embed, embeddings.shape[0]
return embeddings
def process(args):
wavlst = []
for ext in ["wav", "flac"]:
wav_dir = os.path.join(args.input_root, "*." + ext)
wavlst_split = glob.glob(wav_dir)
print(f"{ext.upper()} {len(wavlst_split)} utterances.")
wavlst.extend(wavlst_split)
spkemb_root = args.output_root
if not os.path.exists(spkemb_root):
print(f"Create speaker embedding directory: {spkemb_root}")
os.makedirs(spkemb_root)
device = "cuda" if torch.cuda.is_available() else "cpu"
classifier = EncoderClassifier.from_hparams(source=args.classifier_model, run_opts={"device": device}, savedir=os.path.join('/tmp', args.classifier_model))
size_embed = spk_model[args.classifier_model]
for wav_file in tqdm(wavlst, total=len(wavlst), desc="Extracting XVectors"):
basename = os.path.splitext(os.path.basename(wav_file))[0]
xvector = f2embed(wav_file, classifier, size_embed)
numpy.save(os.path.join(spkemb_root, f"{basename}.npy"), xvector)
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--input-root", "-i", required=True, type=str, help="Input directory containing WAV files.")
parser.add_argument("--output-root", "-o", required=True, type=str, help="Output directory for speaker embeddings.")
parser.add_argument("--classifier-model", "-c", type=str, required=True, default="speechbrain/spkrec-xvect-voxceleb",
choices=["speechbrain/spkrec-xvect-voxceleb", "speechbrain/spkrec-ecapa-voxceleb"],
help="Pretrained model for extracting speaker embeddings.")
args = parser.parse_args()
print(f"Extracting XVectors from {args.input_root}, "
+ f"and saving to {args.output_root}, "
+ f"using the {args.classifier_model} classifier with {spk_model[args.classifier_model]} size.")
process(args)
if __name__ == "__main__":
"""
python get_xvectors.py \
-i /path/to/wav/files \
-o /path/to/output/dir \
-c speechbrain/spkrec-xvect-voxceleb
"""
main()
As you got all the xvectors and used speechbrain/spkrec-xvect-voxceleb
model as well as def f2embed
function, I think it would be good.
@mechanicalsea as my training dataset is multilingual that are multiple indian languages, but the model used is speechbrain/spkrec-xvect-voxceleb
will there be any issues with performance and accuracy of the xvector ? will this have any impact on my speechT5 pretraining ?
@mechanicalsea as my training dataset is multilingual that are multiple indian languages, but the model used is
speechbrain/spkrec-xvect-voxceleb
will there be any issues with performance and accuracy of the xvector ? will this have any impact on my speechT5 pretraining ?
could you post an another issue about extracting multilingual language speaker embedding?
Hi, I found that there must be 3 columns in the audio manifest tsv file. Is there a tutorial or example on how to get the speaker embedding using my own dataset? Is it possible to pretrain a model on a dataset without speaker label? Thanks 😊