Closed marlon-br closed 4 years ago
I am trying to finetune models to support one more speaker, but it looks I am doing something wrong.
I want to use "dia_hard" pipeline, so I need to finetune models: {sad_dihard, scd_dihard, emb_voxceleb}.
For my speaker I have one WAV file with duration more then 1 hour.
So, I created database.yml file:
Databases: IK: /content/fine/kirilov/{uri}.wav Protocols: IK: SpeakerDiarization: kirilov: train: uri: train.lst annotation: train.rttm annotated: train.uem
and put additional files near database.yml:
kirilov ├── database.yml ├── kirilov.wav ├── train.lst ├── train.rttm └── train.uem
train.lst:
kirilov
train.rttm:
SPEAKER kirilov 1 0.0 3600.0 <NA> <NA> Kirilov <NA> <NA>
train.uem:
kirilov NA 0.0 3600.0
I assume it will say trainer to use kirilov.wav file and take 3600 seconds of audio from it to use for training.
It tells pyannote.audio
that speaker Kirilov
speaks for the whole hour of kirilov.wav
(only speech, no non-speech). Therefore, fine-tuning speech activity detection (SAD) will most likely lead the model to always return the speech
class. You need both speech and non-speech regions for fine-tuning to make sense.
Fine-tuning speaker change detection (SCD) and speaker embedding (EMB) with just one speaker does not really make sense either:
And try to run pipeline with new .pt's:
import os import torch from pyannote.audio.pipeline import SpeakerDiarization pipeline = SpeakerDiarization(embedding = "/content/fine/emb/train/IK.SpeakerDiarization.kirilov.train/weights/0001.pt", sad_scores = "/content/fine/sad/train/IK.SpeakerDiarization.kirilov.train/weights/0001.pt", scd_scores = "/content/fine/scd/train/IK.SpeakerDiarization.kirilov.train/weights/0001.pt", method= "affinity_propagation")
The pipeline needs to be adapted to these new SAD, SCD, and EMB models as well. See this tutorial.
Ok, not it seems to be more clear:)
Could you please estimate how much audio (speech, no speech) is required to fine-tune model for a new speaker voice?
It sounds like you might be misunderstanding what speaker diarization is.
If you are trying to detect a particular speaker (using an existing recording of this speaker as enrollment), what you want is speaker tracking, not speaker diarization.
Can you please describe precisely what your final task is?
Well,
The task is to gather a lot hours of a particular speaker talking (to feed that data to a TTS like Tacatron 2 to train it to speak with a new voice). So the idea is to download a lot of video\audio files with that speaker and other people talking and detect\extract all audio segments with my speaker talking. In order to do this I am trying to use pyannote-audio and finetune model to distinguish my speaker voice from others.
I suggest you have a look at this issue that is very similar to what you are trying to achieve.
For now I see the problem, that when I use a random video, for example: https://www.youtube.com/watch?v=5m8SSt4gp7A (sorry for Russian, but the language itself does not mean anything here), there are actually 2 persons talking (I.Kirillov most of the time and some other guy at the end of the video). But the SpeakerDiarization pipeline returns that there are only one speaker talking all the time (it threats both speakers as the same person talking). I thought model fine-tuning for I.Kirillov's voice would let to distinguish his voice from other speakers.
Hi @hbredin, @marlon-br;
I'm trying to fine-tune the dia models using my training data. It works fine forsad
and scd
but when it comes to the emb
model:
$pyannote-audio emb train --pretrained=emb_voxceleb --subset=train --to=1 --parallel=4 "experiments/train_outputs/emb" ADVANCE.SpeakerDiarization.advComp01
I got the following error:
` /usr/local/lib/python3.7/site-packages/pyannote/database/database.py:51: UserWarning: Ignoring deprecated 'preprocessors' argument in MUSAN.init. Pass it to 'get_protocol' instead. warnings.warn(msg)
Using cache found in /Users/xx.yy/.cache/torch/hub/pyannote_pyannote-audio_develop
Traceback (most recent call last):
File "/usr/local/bin/pyannote-audio", line 8, in
File "/usr/local/lib/python3.7/site-packages/pyannote/audio/applications/pyannote_audio.py", line 366, in main app.train(protocol, **params)
File "/usr/local/lib/python3.7/site-packages/pyannote/audio/applications/base.py", line 198, in train protocol_name, progress=True, preprocessors=preprocessors
TypeError: get_protocol() got an unexpected keyword argument 'progress' `
knowing that I've installed pyannote.db. voxceleb, Am I missing something else ?
@marlon-br
For now I see the problem, that when I use a random video, for example: https://www.youtube.com/watch?v=5m8SSt4gp7A (sorry for Russian, but the language itself does not mean anything here), there are actually 2 persons talking (I.Kirillov most of the time and some other guy at the end of the video). But the SpeakerDiarization pipeline returns that there are only one speaker talking all the time (it threats both speakers as the same person talking). I thought model fine-tuning for I.Kirillov's voice would let to distinguish his voice from other speakers.
This may happen when you directly apply the dia/ami trained models on your own data or when you fine-tune them using a very small training set or/and for just a couple of epochs !
TypeError: get_protocol() got an unexpected keyword argument 'progress'
knowing that I've installed pyannote.db. voxceleb, Am I missing something else ?
The API of pyannote.database.get_protocol
has changed recently.
Can you try with the latest version of pyannote.audio (develop branch)?
This problem should be fixed in https://github.com/pyannote/pyannote-audio/commit/c3791bc02ce5bc839a559427628592fad62fdf79.
Thanks for your prompt answer!
The problem was resolved but I got a new one:
ValueError: Missing mandatory 'uri' entry in ADVANCE.SpeakerDiarization.advComp01.train
Actually, yesterday I got the same problem after cloning the last dev version of the project and I though that it was maybe linked to the GPU server!
now I'm facing the same error on local machine.
any idea how can I solve this.
Yes, this is due to the last version of pyannote.database
: the syntax for defining custom speaker diarization protocol has also changed a bit.
The data preparation tutorial has been updated accordingly: https://github.com/pyannote/pyannote-audio/tree/develop/tutorials/data_preparation
I've added the .lst file and now it works! Thanks a lot I've just few questions regarding the input data and the outputs:
Did not load optimizer state (most likely because current training session uses a different loss than the one used for pre-training).
Existing precomputed key "annotation" has been modified by a preprocessor. warnings.warn(msg.format(key=key))
How can I deal with them ?I've added the .lst file and now it works! Thanks a lot I've just few questions regarding the input data and the outputs:
* my data are stereo with 44khz sample rate, should I downsample them to 16khz ?
Data are downsampled on-the-fly. But it probably does not hurt efficiency to downsample them first.
* the system outputs some warning: ` Did not load optimizer state (most likely because current training session uses a different loss than the one used for pre-training).`
This one happens because your list of speakers used for fine-tuning differs from the one used for the original pretraining. Hence the final classification layer has a different shape... This is just a warning: you can simply ignore this.
`Existing precomputed key "annotation" has been modified by a preprocessor. warnings.warn(msg.format(key=key))` How can I deal with them ?
This is due to an additional safety check that happens in speaker diarization protocols: https://github.com/pyannote/pyannote-database/blob/b6e855710dd8e4336de2d0e1c95361c405852534/pyannote/database/protocol/speaker_diarization.py#L100-L102. It looks like some of the provided RTTM annotations are outside of the actual file extent (or of the provided UEM).
@hbredin Dear Hervé, I trained the Diarization pipeline after fine-tuning the sub-models (on my own data) and extracting the raw scores. The obtained results on the test set are quite good. However, I got a high confusion score on some audios. I'm wondering if this is linked to the speaker embedding module ? Another question: Can I use a pretained model (e.g. of EMB) along with the fine-tuned ones (e.g. of SAD and SCD)? if yes, please tell me how.
I trained the Diarization pipeline after fine-tuning the sub-models (on my own data) and extracting the raw scores. The obtained results on the test set are quite good. However, I got a high confusion score on some audios. I'm wondering if this is linked to the speaker embedding module ?
It could be, indeed.
Another question: Can I use a pretained model (e.g. of EMB) along with the fine-tuned ones (e.g. of SAD and SCD)? if yes, please tell me how.
Yes, you can mix pretrained and fine-tuned models. See related issues #439 and #430.
Thanks @hbredin for your reply, My audios have only two speakers, so I'm wondering if I force the model for always considering 2 speakers can help improving the DER? If probably yes, where can I fix this parameter? Is it the "number of speakers per batch" (per_fold) parameter in the embedding's config file ?
There is currently no way to constraint the number of speakers.
Instead, you should tune the pipeline hyper-parameters so that the clustering thresholds and stopping criterion somehow learn the type of data (here, a limited number of speakers).
Closing this issue as it has diverged from the original. Please open a new one if needed.
I am trying to finetune models to support one more speaker, but it looks I am doing something wrong.
I want to use "dia_hard" pipeline, so I need to finetune models: {sad_dihard, scd_dihard, emb_voxceleb}.
For my speaker I have one WAV file with duration more then 1 hour.
So, I created database.yml file:
and put additional files near database.yml:
train.lst:
kirilov
train.rttm:
SPEAKER kirilov 1 0.0 3600.0 <NA> <NA> Kirilov <NA> <NA>
train.uem:
kirilov NA 0.0 3600.0
I assume it will say trainer to use kirilov.wav file and take 3600 seconds of audio from it to use for training.
Now I finetune the models, current folder is /content/fine/kirilov, so database.yml is taken from the current directory:
Output looks like:
Etc.
And try to run pipeline with new .pt's:
The result is that for my new.wav the whole audio is recognized as speaker talking without pauses. So I assume that the models were broken. And it does not matter if I train for 1 epoch or for 100.
In case I use:
or
everything is ok and the result is similar to
Could you please advise what could be wrong with my training\finetuning process?