p0p4k / vits2_pytorch

unofficial vits2-TTS implementation in pytorch
https://arxiv.org/abs/2307.16430
MIT License
477 stars 85 forks source link

Error when training model: AttributeError: '_MultiProcessingDataLoaderIter' object has no attribute '_shutdown' #33

Closed Subarasheese closed 1 year ago

Subarasheese commented 1 year ago

Greetings,

As seen on https://github.com/p0p4k/vits2_pytorch/pull/10#issuecomment-1682307529, someone successfully trained models, so I decided to try it myself.

I used the following command:

python train.py -c configs/vits2_voice_training.json -m mydataset

However, the following happens:

INFO:mydataset:{'train': {'log_interval': 867, 'eval_interval': 867, 'seed': 1234, 'epochs': 20000, 'learning_rate': 0.0002, 'betas': [0.8, 0.99], 'eps': 1e-09, 'batch_size': 16, 'fp16_run': False, 'lr_decay': 0.999875, 'segment_size': 8192, 'init_lr_ratio': 1, 'warmup_epochs': 0, 'c_mel': 45, 'c_kl': 1.0}, 'data': {'training_files': 'filelists/train_voice_1_filelist_v4.txt', 'validation_files': 'filelists/val_voice_1_filelist_v4.txt', 'text_cleaners': ['basic_cleaners'], 'max_wav_value': 32768.0, 'sampling_rate': 22050, 'filter_length': 1024, 'hop_length': 256, 'win_length': 1024, 'n_mel_channels': 80, 'mel_fmin': 0.0, 'mel_fmax': None, 'add_blank': False, 'n_speakers': 0, 'cleaned_text': True, 'use_mel_spec_posterior': False}, 'model': {'use_mel_posterior_encoder': False, 'use_transformer_flows': True, 'transformer_flow_type': 'pre_conv', 'use_spk_conditioned_encoder': False, 'use_noise_scaled_mas': True, 'inter_channels': 192, 'hidden_channels': 192, 'filter_channels': 768, 'n_heads': 2, 'n_layers': 6, 'kernel_size': 3, 'p_dropout': 0.1, 'resblock': '1', 'resblock_kernel_sizes': [3, 7, 11], 'resblock_dilation_sizes': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'upsample_rates': [8, 8, 2, 2], 'upsample_initial_channel': 512, 'upsample_kernel_sizes': [16, 16, 4, 4], 'n_layers_q': 3, 'use_spectral_norm': False}, 'max_text_len': 500, 'model_dir': './logs/mydataset'}
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0
INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
Using lin posterior encoder for VITS1
Using transformer flows pre_conv for VITS2
Using normal encoder for VITS1
Using noise scaled MAS for VITS2
NOT using any duration discriminator like VITS1
Loading train data:   0%|                                                                                                                                 | 0/4 [00:00<?, ?it/s]
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f0f9a049240>
Traceback (most recent call last):
  File "/vits2_pytorch/venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1466, in __del__
    self._shutdown_workers()
  File "/vits2_pytorch/venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1397, in _shutdown_workers
    if not self._shutdown:
AttributeError: '_MultiProcessingDataLoaderIter' object has no attribute '_shutdown'
Traceback (most recent call last):
  File "/vits2_pytorch/train.py", line 417, in <module>
    main()
  File "/vits2_pytorch/train.py", line 54, in main
    mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,))
  File "/vits2_pytorch/venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/vits2_pytorch/venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/vits2_pytorch/venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/vits2_pytorch/venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/vits2_pytorch/train.py", line 196, in run
    train_and_evaluate(rank, epoch, hps, [net_g, net_d, net_dur_disc], [optim_g, optim_d, optim_dur_disc], [scheduler_g, scheduler_d, scheduler_dur_disc], scaler, [train_loader, eval_loader], logger, [writer, writer_eval])
  File "/vits2_pytorch/train.py", line 225, in train_and_evaluate
    for batch_idx, (x, x_lengths, spec, spec_lengths, y, y_lengths) in enumerate(loader):
  File "/vits2_pytorch/venv/lib/python3.10/site-packages/tqdm/std.py", line 1182, in __iter__
    for obj in iterable:
  File "/vits2_pytorch/venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 435, in __iter__
    return self._get_iterator()
  File "/vits2_pytorch/venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 381, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "/vits2_pytorch/venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 988, in __init__
    super(_MultiProcessingDataLoaderIter, self).__init__(loader)
  File "/vits2_pytorch/venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 598, in __init__
    self._sampler_iter = iter(self._index_sampler)
  File "/vits2_pytorch/data_utils.py", line 400, in __iter__
    ids_bucket = ids_bucket + ids_bucket * (rem // len_bucket) + ids_bucket[:(rem % len_bucket)]
ZeroDivisionError: integer division or modulo by zero

What could be the problem here, and what could I try doing to fix this?

p0p4k commented 1 year ago

Is this your private dataset? I suspect the lengths of the text might be less than minimum value or something like that. You can just copy the dataloader part in a .ipynb file and try to debug your data loading part . Load the hps from file using function in the utils.py for the dataloader.

hildazzz commented 1 year ago

Or you can decrease the values of boundaries in DistributedBucketSampler and see what happens.

Subarasheese commented 1 year ago

Is this your private dataset? I suspect the lengths of the text might be less than minimum value or something like that. You can just copy the dataloader part in a .ipynb file and try to debug your data loading part . Load the hps from file using function in the utils.py for the dataloader.

Yes, it is a custom dataset. My dataset looks normal, and the texts are pretty long. The code is erroring out after this line, not sure why as the error logs are not clear. What do I need to inspect to check what is wrong?

image

image

Or you can decrease the values of boundaries in DistributedBucketSampler and see what happens.

Where can I do that?

By the way, this is the config file I am using for training:


{
    "train": {
      "log_interval": 867,
      "eval_interval": 867,
      "seed": 1234,
      "epochs": 20000,
      "learning_rate": 2e-4,
      "betas": [0.8, 0.99],
      "eps": 1e-9,
      "batch_size": 16,
      "fp16_run": false,
      "lr_decay": 0.999875,
      "segment_size": 8192,
      "init_lr_ratio": 1,
      "warmup_epochs": 0,
      "c_mel": 45,
      "c_kl": 1.0
    },
    "data": {
      "training_files":"filelists/train_voice_1_filelist_v4.txt",
      "validation_files":"filelists/val_voice_1_filelist_v4.txt",
      "text_cleaners":["basic_cleaners"],
      "max_wav_value": 32768.0,
      "sampling_rate": 22050,
      "filter_length": 1024,
      "hop_length": 256,
      "win_length": 1024,
      "n_mel_channels": 80,
      "mel_fmin": 0.0,
      "mel_fmax": null,
      "add_blank": false,
      "n_speakers": 0,
      "cleaned_text": true,
      "use_mel_spec_posterior": false
    },
    "model": {
      "use_mel_posterior_encoder": false,
      "use_transformer_flows": true,
      "transformer_flow_type": "pre_conv",
      "use_spk_conditioned_encoder": false,
      "use_noise_scaled_mas": true,
      "inter_channels": 192,
      "hidden_channels": 192,
      "filter_channels": 768,
      "n_heads": 2,
      "n_layers": 6,
      "kernel_size": 3,
      "p_dropout": 0.1,
      "resblock": "1",
      "resblock_kernel_sizes": [3,7,11],
      "resblock_dilation_sizes": [[1,3,5], [1,3,5], [1,3,5]],
      "upsample_rates": [8,8,2,2],
      "upsample_initial_channel": 512,
      "upsample_kernel_sizes": [16,16,4,4],
      "n_layers_q": 3,
      "use_spectral_norm": false
    },
    "max_text_len": 500
  }
beqabeqa473 commented 1 year ago

Hi all.

btw, can i turn off validation and test lists? i would be able to validate model myself. My dataset is not so large to throw away sentences to validations and tests

p0p4k commented 1 year ago

just pass in the same list to both train and validation. For the validation loader use torch.utils.data.Subset and pass only 4-5 samples so that you can get the evaluation while training. If you want to completely turn off evaluation, just comment out evaluate() function in train.py.

beqabeqa473 commented 1 year ago

and what about test.txt in filelists?

On 8/31/23, p0p @.***> wrote:

just pass in the same list to both train and validation. For the validation loader use torch.utils.data.Subset and pass only 4-5 samples so that you can get the evaluation while training. If you want to completely turn off evaluation, just comment out evaluate() function in train.py.

-- Reply to this email directly or view it on GitHub: https://github.com/p0p4k/vits2_pytorch/issues/33#issuecomment-1700469489 You are receiving this because you commented.

Message ID: @.***>

-- with best regards Beqa Gozalishvili Tell: +995593454005 Email: @.*** Web: https://gozaltech.org Skype: beqabeqa473 Telegram: https://t.me/gozaltech facebook: https://facebook.com/gozaltech twitter: https://twitter.com/beqabeqa473 Instagram: https://instagram.com/beqa.gozalishvili

p0p4k commented 1 year ago

Test is the evaluation in this repo.

beqabeqa473 commented 1 year ago

so validations and tests is the same?

On 8/31/23, p0p @.***> wrote:

Test is the evaluation in this repo.

-- Reply to this email directly or view it on GitHub: https://github.com/p0p4k/vits2_pytorch/issues/33#issuecomment-1700509240 You are receiving this because you commented.

Message ID: @.***>

-- with best regards Beqa Gozalishvili Tell: +995593454005 Email: @.*** Web: https://gozaltech.org Skype: beqabeqa473 Telegram: https://t.me/gozaltech facebook: https://facebook.com/gozaltech twitter: https://twitter.com/beqabeqa473 Instagram: https://instagram.com/beqa.gozalishvili

p0p4k commented 1 year ago

Yes, we just send 2 lists in config. Train and val

Subarasheese commented 1 year ago

No clue on what the problem might be?

What do you suggest me to do with the dataset?

p0p4k commented 1 year ago

Print out the "len_bucket" in data_utils and try to debug from there.

Subarasheese commented 1 year ago

Print out the "len_bucket" in data_utils and try to debug from there.

Those were the outputs

buckets from line 371: 0 8 34

image

So the first bucket has length 0, the second has length 8, and the last has length 34

Is there anything wrong about it?

p0p4k commented 1 year ago

So bucket length 0 is causing the issue. Cannot divide by 0. You can just entirely disable this function the dataloader temporarily.

Subarasheese commented 1 year ago

@p0p4k I managed to get the training working by making a few changes, including:

1 - skipping the "0" bucket in that for-loop to avoid the exception 2 - Editing the symbols file (non-English language) 3 - The mel_processing file had a bug, I needed to replace the mel attribution to this: mel = librosa_mel_fn(sr=sampling_rate, n_fft=n_fft, n_mels=num_mels, fmin=fmin, fmax=fmax)

And now it started training, but the bucket issue sure is strange and I believe it was not supposed to happen. If that is going to compromise the model, is something that has yet to be seen.

But I am going ahead and close the issue. Please, if you can, look into what this could be, or at least improve the log to make whatever the problem is clearer.