Closed zuowanbushiwo closed 1 year ago
It can run after modifying loss and contrastive_learning
Does this change affect performance? Can it achieve the effect of Libri2Mix-clean-16k-max TD-DP-SkiM(causal) in your readme?
Hi,
If you using contrstive_learning it would ignore the batch_size option. Then the batch size would set as pspks x p utts.
BTW repeat option would not affect the batch size, it only affect the numbers of total batch in each epoch.
Change the speaker loss and any hyper parameters would change the performance absolutely.
Hi Wu Thank you very much for your reply. According to your experience, if I need better results, I need to set contrarstive_learning to true and use ge2e loss at the same time? By setting p_utts and p_utts to appropriate values . Can I get better results if I use a pre-trained speaker embedding model (like wespeaker)? Thanks!
Result:
epoch: 135
lr: 7.8125e-07
loss: -12.406385647825322
best_epoch: 110
best_dev_loss: -9.987813591323
Conf:
DATASET:
type: TSE
sample_rate: 16000
max_length: 3 # input snipit length
train: /home/yangjie/PureSound/egs/tse/train-100-clean # train set folder
dev: /home/yangjie/PureSound/egs/tse/dev-clean # dev set folder
eval: /home/yangjie/PureSound/egs/tse/test-clean # eval set folder for tensorboard logging
noise_folder: # if not None, applies noise inject
rir_folder: # if not None, applies RIR
rir_mode: # image/direct/early
vol_perturbed: # Tuple
speed_perturbed: False
perturb_frequency_response: False
single_spk_prob: 0.
inactive_training: 0.
enroll_rule: fixed_length
enroll_augment: False
MODEL:
type: tse_skim_v0_causal # model-name
LOSS:
sig_loss: sisnr
sig_threshold: #-30
alpha: 1
cls_loss: aamsoftmax
cls_loss_other:
embed_dim: 192 # for aamsoftmax
n_class: 251 # for aamsoftmax
margin: 0.2 # for aamsoftmax
scale: 30 # for aamsoftmax
OPTIMIZER:
gradiend_clip: 10
lr: 0.0001
num_epochs_decay: 0
lr_scheduler: Plateau
mode: min
patience: 5
gamma: 0.5
beta1: 0.9
beta2: 0.999
weight_decay: 0.
multi_rate: False
TRAIN:
num_epochs: 200
resume_epoch:
contrastive_learning: False
p_spks: 6
p_utts: 4
repeat: 1
batch_size: 32
multi_gpu: False
num_workers: 2
model_average:
use_tensorboard: True
model_save_dir: models # model save folder
log_dir: logs # logging save folder
Hi,
In my experience, even the best pre-trained speaker identification or speaker verification model could not get the better performance. My thought is that the speaker embedding from pre-trained model only provided the guiding in the TSE task rather than helping separation performance. There are some reference papers:
But joint training would cause the OOM issue in the training when your batch size as large as you independently training the SID/SV task. So that is a trade off problem.
Thanks.
Hi Wu Thanks! Your answer is so helpful to me . I use the skim_causal_460_wNoise_IS_tsdr model in real environment, most of the time the effect is not good, and the effect is poor when there is a music background. Is it because there are too few training data? Can I use AISHELL, RIRS_NOISES and musan together with the data augmentation method you provided to fine-tune your skim_causal_460_wNoise_IS_tsdr. Is it possible to improve the effect?
Absolutely.
For the real world using there are much more conditions you should consider.
The first issue is the speaker variants (we can't control your speaking anytime) such as far-field, head moving even accent would challenge your model's robustness. There are an universal issue is that the only using speech clue like speaker embedding is too week to cover, so vision clues or something else already joined to improve the TSE task.
Another fundamental SE problem is that environment effects like reverberation, noises, microphone devices response and so on.
Thanks
Thank you very much for your answer, your answer is very useful to me as always. Next, I will try to use more data and use more data augmentation methods to train the model, hoping to improve the effect.
Hi wu How to train the tse model? I use this command:
CUDA_VISIBLE_DEVICES=2 python main.py conf/libri2mix_max_2spk_clean_16k_1c.yaml --backend cuda
Although I set the batch size to 1, the following error is still reported. Is there any way to solve it?
My GPU is RTX3090Ti My conf:
origin libri2mix_max_2spk_clean_16k_1c.conf miss some fields , it will cause a crash. I have added them compared to TSE.yaml
There are some strange things here no matter how much I set the batch_size to, it is always 1/1737. According to my understanding, different batch sizes should be different from 1737. I also print this hparam["TRAIN"]["batch_size”] in the init_dataloader function "] The setting is indeed effective.
Thanks!