kan-bayashi / ParallelWaveGAN

Unofficial Parallel WaveGAN (+ MelGAN & Multi-band MelGAN & HiFi-GAN & StyleMelGAN) with Pytorch
https://kan-bayashi.github.io/ParallelWaveGAN/
MIT License
1.56k stars 341 forks source link

error in TTS teacher-forcing while do finetuning for different datasets #315

Closed tongjiyiming closed 2 years ago

tongjiyiming commented 2 years ago

I am trying the finetuning part in this single speaker TTS training: https://github.com/kan-bayashi/ParallelWaveGAN/blob/master/egs/README.md

I am a first-time user. Would you help a little on this issue?

I run into an error when I extract the following:

!./run.sh \
    --ngpu 1 \
    --stage 7 \
   --vocoder_file "/ParallelWaveGAN/egs/chineinfocus_single/voc1/exp/train_nodev_parallel_wavegan.v1/checkpoint-400000steps.pkl" \
    --download_model "kan-bayashi/ljspeech_conformer_fastspeech2" \
    --test_sets "tr_no_dev dev" \
    --inference_args "--use_teacher_forcing true"

the log file show:

# python3 -m espnet2.bin.tts_inference --ngpu 0 --data_path_and_name_and_type dump/raw/tr_no_dev/text,text,text --data_path_and_name_and_type dump/raw/tr_no_dev/wav.scp,speech,sound --key_file exp/kan-bayashi/ljspeech_conformer_fastspeech2/decode_use_teacher_forcingtrue_train.loss.ave/tr_no_dev/log/keys.1.scp --model_file exp/kan-bayashi/ljspeech_conformer_fastspeech2/train.loss.ave_5best.pth --train_config exp/kan-bayashi/ljspeech_conformer_fastspeech2/config.yaml --output_dir exp/kan-bayashi/ljspeech_conformer_fastspeech2/decode_use_teacher_forcingtrue_train.loss.ave/tr_no_dev/log/output.1 --vocoder_file /ParallelWaveGAN/egs/chineinfocus_single/voc1/exp/train_nodev_parallel_wavegan.v1/checkpoint-400000steps.pkl --config conf/decode.yaml --use_teacher_forcing true 
# Started at Sat Dec 18 03:08:37 UTC 2021
#
/opt/miniconda/bin/python3 /opt/miniconda/lib/python3.7/site-packages/espnet2/bin/tts_inference.py --ngpu 0 --data_path_and_name_and_type dump/raw/tr_no_dev/text,text,text --data_path_and_name_and_type dump/raw/tr_no_dev/wav.scp,speech,sound --key_file exp/kan-bayashi/ljspeech_conformer_fastspeech2/decode_use_teacher_forcingtrue_train.loss.ave/tr_no_dev/log/keys.1.scp --model_file exp/kan-bayashi/ljspeech_conformer_fastspeech2/train.loss.ave_5best.pth --train_config exp/kan-bayashi/ljspeech_conformer_fastspeech2/config.yaml --output_dir exp/kan-bayashi/ljspeech_conformer_fastspeech2/decode_use_teacher_forcingtrue_train.loss.ave/tr_no_dev/log/output.1 --vocoder_file /ParallelWaveGAN/egs/chineinfocus_single/voc1/exp/train_nodev_parallel_wavegan.v1/checkpoint-400000steps.pkl --config conf/decode.yaml --use_teacher_forcing true
2021-12-18 03:08:42,518 (tts:285) INFO: Vocabulary size: 78
2021-12-18 03:08:42,673 (fastspeech2:263) WARNING: Fallback to conformer_pos_enc_layer_type = 'legacy_rel_pos' due to the compatibility. If you want to use the new one, please use conformer_pos_enc_layer_type = 'latest'.
2021-12-18 03:08:42,673 (fastspeech2:270) WARNING: Fallback to conformer_self_attn_layer_type = 'legacy_rel_selfattn' due to the compatibility. If you want to use the new one, please use conformer_pos_enc_layer_type = 'latest'.
2021-12-18 03:08:45,370 (parallel_wavegan:230) INFO: Successfully registered stats as buffer.
2021-12-18 03:08:45,410 (tts_inference:125) INFO: Extractor:
LogMelFbank(
  (stft): Stft(n_fft=1024, win_length=1024, hop_length=256, center=True, normalized=False, onesided=True)
  (logmel): LogMel(sr=22050, n_fft=1024, n_mels=80, fmin=80, fmax=7600, htk=False)
)
2021-12-18 03:08:45,411 (tts_inference:126) INFO: Normalizer:
GlobalMVN(stats_file=/opt/miniconda/lib/python3.7/site-packages/espnet_model_zoo/59c43ac0d40b121060bd71dd418f5ece/exp/tts_train_tacotron2_raw_phn_tacotron_g2p_en_no_space/decode_tacotron2_teacher_forcing_train.loss.best/stats/train/feats_stats.npz, norm_means=True, norm_vars=True)
2021-12-18 03:08:45,414 (tts_inference:127) INFO: TTS:
FastSpeech2(
... ignore many lines on model structures
)
Traceback (most recent call last):
  File "/opt/miniconda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/miniconda/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/miniconda/lib/python3.7/site-packages/espnet2/bin/tts_inference.py", line 753, in <module>
    main()
  File "/opt/miniconda/lib/python3.7/site-packages/espnet2/bin/tts_inference.py", line 749, in main
    inference(**kwargs)
  File "/opt/miniconda/lib/python3.7/site-packages/espnet2/bin/tts_inference.py", line 445, in inference
    output_dict = text2speech(**batch)
  File "/opt/miniconda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/opt/miniconda/lib/python3.7/site-packages/espnet2/bin/tts_inference.py", line 204, in __call__
    output_dict = self.model.inference(**batch, **cfg)
  File "/opt/miniconda/lib/python3.7/site-packages/espnet2/tts/espnet_model.py", line 274, in inference
    durations=durations[None],
TypeError: 'NoneType' object is not subscriptable
# Accounting: time=11 threads=1
# Ended (code 1) at Sat Dec 18 03:08:48 UTC 2021, elapsed time 11 seconds
kan-bayashi commented 2 years ago

To perform teacher forcing decoding with FastSpeech, we use to provide the ground truth durations. Therefore, you need to add the option --teacher_dumpdir as the same as the training.

tongjiyiming commented 2 years ago

Thank you! after a few tests, I got another error that I can not solve:

FileNotFoundError: [Errno 2] No such file or directory: 'dump/raw/tr_no_dev/durations'

I tried to return my training stage:

!./run.sh --stage 0 --stop-stage 5 \
    --teacher_dumpdir "dump/raw" \
    --vocoder_file "/ParallelWaveGAN/egs/chineinfocus_single/voc1/exp/train_nodev_parallel_wavegan.v1/checkpoint-400000steps.pkl" \
    --download_model "kan-bayashi/ljspeech_conformer_fastspeech2"

I wonder why I got the same error. It looks like that durations is not created during data preparing stage.

[15c6733b68b7] 2021-12-22 03:59:21,992 (abs_task:1157) INFO: Namespace(accum_grad=1, allow_variable_data_keys=False, batch_bins=5120000, batch_size=20, batch_type='numel', best_model_criterion=[['valid', 'loss', 'min'], ['train', 'loss', 'min']], bpemodel=None, chunk_length=500, chunk_shift_ratio=0.5, cleaner='tacotron', collect_stats=True, config='conf/train.yaml', cudnn_benchmark=False, cudnn_deterministic=True, cudnn_enabled=True, detect_anomaly=False, dist_backend='nccl', dist_init_method='env://', dist_launcher=None, dist_master_addr=None, dist_master_port=None, dist_rank=None, dist_world_size=None, distributed=False, dry_run=False, early_stopping_criterion=('valid', 'loss', 'min'), energy_extract=None, energy_extract_conf={'fs': 22050, 'n_fft': 1024, 'hop_length': 256, 'win_length': None}, energy_normalize=None, energy_normalize_conf={}, feats_extract='fbank', feats_extract_conf={'n_fft': 1024, 'hop_length': 256, 'win_length': None, 'fs': 22050, 'fmin': 80, 'fmax': 7600, 'n_mels': 80}, fold_length=[], freeze_param=[], g2p='g2p_en_no_space', grad_clip=1.0, grad_clip_type=2.0, grad_noise=False, ignore_init_mismatch=False, init_param=[], iterator_type='sequence', keep_nbest_models=5, local_rank=None, log_interval=None, log_level='INFO', max_cache_fd=32, max_cache_size=0.0, max_epoch=200, model_conf={}, multiple_iterator=False, multiprocessing_distributed=False, ngpu=0, no_forward_run=False, non_linguistic_symbols=None, normalize=None, normalize_conf={}, num_att_plot=3, num_cache_chunks=1024, num_iters_per_epoch=500, num_workers=1, odim=None, optim='adam', optim_conf={'lr': 0.001, 'eps': 1e-06, 'weight_decay': 0.0}, output_dir='exp/tts_stats_raw_phn_tacotron_g2p_en_no_space/logdir/stats.1', patience=None, pitch_extract=None, pitch_extract_conf={'fs': 22050, 'n_fft': 1024, 'hop_length': 256, 'f0max': 400, 'f0min': 80}, pitch_normalize=None, pitch_normalize_conf={}, pretrain_path=None, print_config=False, required=['output_dir', 'token_list'], resume=False, scheduler=None, scheduler_conf={}, seed=0, sharded_ddp=False, sort_batch='descending', sort_in_batch='descending', token_list=['<blank>', '<unk>', 'AH0', 'N', 'T', 'D', 'S', 'R', 'L', 'DH', 'K', 'Z', 'IH1', 'IH0', 'M', 'EH1', 'W', 'P', 'AE1', 'AH1', 'V', 'ER0', 'F', ',', 'AA1', 'B', 'HH', 'IY1', 'UW1', 'IY0', 'AO1', 'EY1', 'AY1', '.', 'OW1', 'SH', 'NG', 'G', 'ER1', 'CH', 'JH', 'Y', 'AW1', 'TH', 'UH1', 'EH2', 'OW0', 'EY2', 'AO0', 'IH2', 'AE2', 'AY2', 'AA2', 'UW0', 'EH0', 'OY1', 'EY0', 'AO2', 'ZH', 'OW2', 'AE0', 'UW2', 'AH2', 'AY0', 'IY2', 'AW2', 'AA0', "'", 'ER2', 'UH2', '?', 'OY2', '!', 'AW0', 'UH0', 'OY0', '..', '<sos/eos>'], token_type='phn', train_data_path_and_name_and_type=[('dump/raw/tr_no_dev/text', 'text', 'text'), ('dump/raw/tr_no_dev/wav.scp', 'speech', 'sound'), ('dump/raw/tr_no_dev/durations', 'durations', 'text_int')], train_dtype='float32', train_shape_file=['exp/tts_stats_raw_phn_tacotron_g2p_en_no_space/logdir/train.1.scp'], tts='tacotron2', tts_conf={'embed_dim': 512, 'elayers': 1, 'eunits': 512, 'econv_layers': 3, 'econv_chans': 512, 'econv_filts': 5, 'atype': 'location', 'adim': 512, 'aconv_chans': 32, 'aconv_filts': 15, 'cumulate_att_w': True, 'dlayers': 2, 'dunits': 1024, 'prenet_layers': 2, 'prenet_units': 256, 'postnet_layers': 5, 'postnet_chans': 512, 'postnet_filts': 5, 'output_activation': None, 'use_batch_norm': True, 'use_concate': True, 'use_residual': False, 'dropout_rate': 0.5, 'zoneout_rate': 0.1, 'reduction_factor': 1, 'spk_embed_dim': None, 'use_masking': True, 'bce_pos_weight': 5.0, 'use_guided_attn_loss': True, 'guided_attn_loss_sigma': 0.4, 'guided_attn_loss_lambda': 1.0}, unused_parameters=False, use_amp=False, use_preprocessor=True, use_tensorboard=True, use_wandb=False, val_scheduler_criterion=('valid', 'loss'), valid_batch_bins=None, valid_batch_size=None, valid_batch_type=None, valid_data_path_and_name_and_type=[('dump/raw/dev/text', 'text', 'text'), ('dump/raw/dev/wav.scp', 'speech', 'sound'), ('dump/raw/dev/durations', 'durations', 'text_int')], valid_max_cache_size=None, valid_shape_file=['exp/tts_stats_raw_phn_tacotron_g2p_en_no_space/logdir/valid.1.scp'], version='0.10.4a1', wandb_entity=None, wandb_id=None, wandb_model_log_interval=-1, wandb_name=None, wandb_project=None, write_collected_feats=False)
Traceback (most recent call last):
  File "/opt/miniconda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/miniconda/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/miniconda/lib/python3.7/site-packages/espnet2/bin/tts_train.py", line 22, in <module>
    main()
  File "/opt/miniconda/lib/python3.7/site-packages/espnet2/bin/tts_train.py", line 18, in main
    TTSTask.main(cmd=cmd)
  File "/opt/miniconda/lib/python3.7/site-packages/espnet2/tasks/abs_task.py", line 994, in main
    cls.main_worker(args)
  File "/opt/miniconda/lib/python3.7/site-packages/espnet2/tasks/abs_task.py", line 1198, in main_worker
    write_collected_feats=args.write_collected_feats,
  File "/opt/miniconda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/opt/miniconda/lib/python3.7/site-packages/espnet2/main_funcs/collect_stats.py", line 55, in collect_stats
    for iiter, (keys, batch) in enumerate(itr, 1):
  File "/opt/miniconda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
    data = self._next_data()
  File "/opt/miniconda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data
    return self._process_data(data)
  File "/opt/miniconda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
    data.reraise()
  File "/opt/miniconda/lib/python3.7/site-packages/torch/_utils.py", line 425, in reraise
    raise self.exc_type(msg)
FileNotFoundError: Caught FileNotFoundError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/opt/miniconda/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "/opt/miniconda/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 28, in fetch
    data.append(next(self.dataset_iter))
  File "/opt/miniconda/lib/python3.7/site-packages/espnet2/train/iterable_dataset.py", line 155, in __iter__
    files = [open(lis[0], encoding="utf-8") for lis in self.path_name_type_list]
  File "/opt/miniconda/lib/python3.7/site-packages/espnet2/train/iterable_dataset.py", line 155, in <listcomp>
    files = [open(lis[0], encoding="utf-8") for lis in self.path_name_type_list]
FileNotFoundError: [Errno 2] No such file or directory: 'dump/raw/tr_no_dev/durations'
kan-bayashi commented 2 years ago

The durations file is created by the teacher model, e.g., tactoron2 or transformer. Please check how to train fastspeech / fastspeech 2. https://github.com/espnet/espnet/tree/master/egs2/TEMPLATE/tts1#fastspeech-training

tongjiyiming commented 2 years ago

Thank you! I think I do not need to retrain the teacher model, right? just run the preprocessing stage?