mcw519 / PureSound

Make the sound you hear pure and clean by deep learning.
7 stars 0 forks source link

How to train tse? #5

Closed zuowanbushiwo closed 1 year ago

zuowanbushiwo commented 1 year ago

Hi wu How to train the tse model? I use this command: CUDA_VISIBLE_DEVICES=2 python main.py conf/libri2mix_max_2spk_clean_16k_1c.yaml --backend cuda
Although I set the batch size to 1, the following error is still reported. Is there any way to solve it?


init loss function: sisnr
Initialized a multi-task model, sharing a same speech encoder.
Multi-task training has two loss function.
Current task label: 1
---------------Verbose logging---------------
Current training mode is: False
Total params: 6375442
Lookahead(samples): 16
Receptive Fields(samples): infinite
Current training mode is: True
---------------Verbose logging---------------
  0%|                                                                                                                            | 0/1737 [00:00<?, ?it/s]epoch: 0, iter: 1, batch_loss: 44.908958435058594, signal_loss: 39.039031982421875, class_loss: 117.39849853515625
  0%|                                                                                                                    | 1/1737 [00:01<46:59,  1.62s/it]epoch: 0, iter: 2, batch_loss: 29.754398345947266, signal_loss: 23.969154357910156, class_loss: 115.7048568725586
  0%|                                                                                                                  | 1/1737 [00:02<1:17:31,  2.68s/it]
Traceback (most recent call last):
  File "main.py", line 466, in <module>
    main(config)
  File "main.py", line 154, in main
    trainer.train()
  File "/home/yangjie/PureSound/egs/tse/puresound/task/base.py", line 383, in train
    loss = self.train_one_epoch(current_epoch=epoch)
  File "/home/yangjie/PureSound/egs/tse/puresound/task/tse.py", line 597, in train_one_epoch
    loss.backward()
  File "/home/yangjie/miniconda3/envs/pyannote/lib/python3.8/site-packages/torch/_tensor.py", line 363, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/yangjie/miniconda3/envs/pyannote/lib/python3.8/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA out of memory. Tried to allocate 570.00 MiB (GPU 0; 23.70 GiB total capacity; 20.34 GiB already allocated; 380.81 MiB free; 21.50 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

My GPU is RTX3090Ti My conf:

DATASET:
  type: TSE
  sample_rate: 16000
  max_length: 3 # input snipit length
  train: /home/yangjie/PureSound/egs/tse/train-100-clean # train set folder
  dev: /home/yangjie/PureSound/egs/tse/dev-clean # dev set folder
  eval: /home/yangjie/PureSound/egs/tse/test-clean # eval set folder for tensorboard logging
  noise_folder: # if not None, applies noise inject
  rir_folder: # if not None, applies RIR
  rir_mode: # image/direct/early
  vol_perturbed: # Tuple
  speed_perturbed: False
  perturb_frequency_response: False
  single_spk_prob: 0.
  inactive_training: 0.
  enroll_rule: fixed_length
  enroll_augment: False

MODEL:
  type: tse_skim_v0_causal # model-name

LOSS:
  sig_loss: sisnr
  sig_threshold: #-30
  alpha: 0.05
  cls_loss: ge2e
  cls_loss_other:
  embed_dim: 192 # for aamsoftmax
  n_class: 251 # for aamsoftmax
  margin: 0.2 # for aamsoftmax
  scale: 30 # for aamsoftmax

OPTIMIZER:
  gradiend_clip: 10
  lr: 0.0001
  num_epochs_decay: 0
  lr_scheduler: Plateau
  mode: min
  patience: 5
  gamma: 0.5
  beta1: 0.9
  beta2: 0.999
  weight_decay: 0.
  multi_rate: False

TRAIN:
  num_epochs: 200
  resume_epoch:
  contrastive_learning: True
  p_spks: 12
  p_utts: 4
  repeat: 3
  batch_size: 1
  multi_gpu: False
  num_workers: 2
  model_average:
  use_tensorboard: True
  model_save_dir: models # model save folder
  log_dir: logs # logging save folder

origin libri2mix_max_2spk_clean_16k_1c.conf miss some fields , it will cause a crash. I have added them compared to TSE.yaml

diff --git a/egs/tse/conf/libri2mix_max_2spk_clean_16k_1c.yaml b/egs/tse/conf/libri2mix_max_2spk_clean_16k_1c.yaml
index 6b8b089..8de0aae 100644
--- a/egs/tse/conf/libri2mix_max_2spk_clean_16k_1c.yaml
+++ b/egs/tse/conf/libri2mix_max_2spk_clean_16k_1c.yaml
@@ -2,14 +2,15 @@ DATASET:
   type: TSE
   sample_rate: 16000
   max_length: 3 # input snipit length
-  train: train # train set folder
-  dev: dev # dev set folder
-  eval: eval # eval set folder for tensorboard logging
+  train: /home/yangjie/PureSound/egs/tse/train-100-clean # train set folder
+  dev: /home/yangjie/PureSound/egs/tse/dev-clean # dev set folder
+  eval: /home/yangjie/PureSound/egs/tse/test-clean # eval set folder for tensorboard logging
   noise_folder: # if not None, applies noise inject
   rir_folder: # if not None, applies RIR
   rir_mode: # image/direct/early
   vol_perturbed: # Tuple
   speed_perturbed: False
+  perturb_frequency_response: False
   single_spk_prob: 0.
   inactive_training: 0.
   enroll_rule: fixed_length
@@ -23,6 +24,7 @@ LOSS:
   sig_threshold: #-30
   alpha: 0.05
   cls_loss: ge2e
+  cls_loss_other:
   embed_dim: 192 # for aamsoftmax
   n_class: 251 # for aamsoftmax
   margin: 0.2 # for aamsoftmax
@@ -39,6 +41,7 @@ OPTIMIZER:
   beta1: 0.9
   beta2: 0.999
   weight_decay: 0.
+  multi_rate: False

 TRAIN:
   num_epochs: 200
@@ -47,9 +50,9 @@ TRAIN:
   p_spks: 12
   p_utts: 4
   repeat: 3
-  batch_size: 12
-  multi_gpu: True
-  num_workers: 10
+  batch_size: 1
+  multi_gpu: False
+  num_workers: 2
   model_average:
   use_tensorboard: True
   model_save_dir: models # model save folder

There are some strange things here no matter how much I set the batch_size to, it is always 1/1737. According to my understanding, different batch sizes should be different from 1737. I also print this hparam["TRAIN"]["batch_size”] in the init_dataloader function "] The setting is indeed effective.

Thanks!

zuowanbushiwo commented 1 year ago

It can run after modifying loss and contrastive_learning image Does this change affect performance? Can it achieve the effect of Libri2Mix-clean-16k-max TD-DP-SkiM(causal) in your readme?

mcw519 commented 1 year ago

Hi,

If you using contrstive_learning it would ignore the batch_size option. Then the batch size would set as pspks x p utts.

BTW repeat option would not affect the batch size, it only affect the numbers of total batch in each epoch.

Change the speaker loss and any hyper parameters would change the performance absolutely.

zuowanbushiwo commented 1 year ago

Hi Wu Thank you very much for your reply. According to your experience, if I need better results, I need to set contrarstive_learning to true and use ge2e loss at the same time? By setting p_utts and p_utts to appropriate values . Can I get better results if I use a pre-trained speaker embedding model (like wespeaker)? Thanks!

zuowanbushiwo commented 1 year ago

Result:

epoch: 135
lr: 7.8125e-07
loss: -12.406385647825322
best_epoch: 110
best_dev_loss: -9.987813591323

Conf:

DATASET:
  type: TSE
  sample_rate: 16000
  max_length: 3 # input snipit length
  train: /home/yangjie/PureSound/egs/tse/train-100-clean # train set folder
  dev: /home/yangjie/PureSound/egs/tse/dev-clean # dev set folder
  eval: /home/yangjie/PureSound/egs/tse/test-clean # eval set folder for tensorboard logging
  noise_folder: # if not None, applies noise inject
  rir_folder: # if not None, applies RIR
  rir_mode: # image/direct/early
  vol_perturbed: # Tuple
  speed_perturbed: False
  perturb_frequency_response: False
  single_spk_prob: 0.
  inactive_training: 0.
  enroll_rule: fixed_length
  enroll_augment: False

MODEL:
  type: tse_skim_v0_causal # model-name

LOSS:
  sig_loss: sisnr
  sig_threshold: #-30
  alpha: 1
  cls_loss: aamsoftmax
  cls_loss_other:
  embed_dim: 192 # for aamsoftmax
  n_class: 251 # for aamsoftmax
  margin: 0.2 # for aamsoftmax
  scale: 30 # for aamsoftmax

OPTIMIZER:
  gradiend_clip: 10
  lr: 0.0001
  num_epochs_decay: 0
  lr_scheduler: Plateau
  mode: min
  patience: 5
  gamma: 0.5
  beta1: 0.9
  beta2: 0.999
  weight_decay: 0.
  multi_rate: False

TRAIN:
  num_epochs: 200
  resume_epoch:
  contrastive_learning: False
  p_spks: 6
  p_utts: 4
  repeat: 1
  batch_size: 32
  multi_gpu: False
  num_workers: 2
  model_average:
  use_tensorboard: True
  model_save_dir: models # model save folder
  log_dir: logs # logging save folder
mcw519 commented 1 year ago

Hi,

In my experience, even the best pre-trained speaker identification or speaker verification model could not get the better performance. My thought is that the speaker embedding from pre-trained model only provided the guiding in the TSE task rather than helping separation performance. There are some reference papers:

  1. Quantitative Evidence on Overlooked Aspects of Enrollment Speaker Embeddings for Target Speaker Separation
  2. Target Confusion in End-to-end Speaker Extraction: Analysis and Approaches

But joint training would cause the OOM issue in the training when your batch size as large as you independently training the SID/SV task. So that is a trade off problem.

Thanks.

zuowanbushiwo commented 1 year ago

Hi Wu Thanks! Your answer is so helpful to me . I use the skim_causal_460_wNoise_IS_tsdr model in real environment, most of the time the effect is not good, and the effect is poor when there is a music background. Is it because there are too few training data? Can I use AISHELL, RIRS_NOISES and musan together with the data augmentation method you provided to fine-tune your skim_causal_460_wNoise_IS_tsdr. Is it possible to improve the effect?

mcw519 commented 1 year ago

Absolutely.

For the real world using there are much more conditions you should consider.

The first issue is the speaker variants (we can't control your speaking anytime) such as far-field, head moving even accent would challenge your model's robustness. There are an universal issue is that the only using speech clue like speaker embedding is too week to cover, so vision clues or something else already joined to improve the TSE task.

Another fundamental SE problem is that environment effects like reverberation, noises, microphone devices response and so on.

Thanks

zuowanbushiwo commented 1 year ago

Thank you very much for your answer, your answer is very useful to me as always. Next, I will try to use more data and use more data augmentation methods to train the model, hoping to improve the effect.