Please help me! - Githubissues

Deerzh commented 1 year ago

when I run this command: python train.py -c configs/ljs_base.json -m ljs_base an error occurred like this:

[INFO] {'train': {'log_interval': 200, 'eval_interval': 1000, 'seed': 1234, 'epochs': 1, 'learning_rate': 0.0002, 'betas': [0.8, 0.99], 'eps': 1e-09, 'batch_size': 16, 'fp16_run': True, 'lr_decay': 0.999875, 'segment_size': 8192, 'init_lr_ratio': 1, 'warmup_epochs': 0, 'c_mel': 45, 'c_kl': 1.0}, 'data': {'training_files': '/home/zhang/compatibility_analysis/vits/filelists/ljs_audio_text_train_filelist.txt.cleaned', 'validation_files': '/home/zhang/compatibility_analysis/vits/filelists/ljs_audio_text_val_filelist.txt.cleaned', 'text_cleaners': ['english_cleaners2'], 'max_wav_value': 32768.0, 'sampling_rate': 22050, 'filter_length': 1024, 'hop_length': 256, 'win_length': 1024, 'n_mel_channels': 80, 'mel_fmin': 0.0, 'mel_fmax': None, 'add_blank': True, 'n_speakers': 0, 'cleaned_text': True}, 'model': {'inter_channels': 192, 'hidden_channels': 192, 'filter_channels': 768, 'n_heads': 2, 'n_layers': 6, 'kernel_size': 3, 'p_dropout': 0.1, 'resblock': '1', 'resblock_kernel_sizes': [3, 7, 11], 'resblock_dilation_sizes': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'upsample_rates': [8, 8, 2, 2], 'upsample_initial_channel': 512, 'upsample_kernel_sizes': [16, 16, 4, 4], 'n_layers_q': 3, 'use_spectral_norm': False}, 'model_dir': './logs/ljs_base'} ./logs/ljs_base/G_5.pth [INFO] Loaded checkpoint './logs/ljs_base/G_5.pth' (iteration 1) ./logs/ljs_base/D_5.pth [INFO] Loaded checkpoint './logs/ljs_base/D_5.pth' (iteration 1) enumerate(train_loader)= <enumerate object at 0x7f6583451e10> [INFO] Train Epoch: 1 [0%] [INFO] [3.5478122234344482, 0.7687594294548035, 0.36579176783561707, 91.01992797851562, 1.5905554294586182, 147.99365234375, 0, 0.0002] [INFO] Saving model and optimizer state at iteration 1 to ./logs/ljs_base/G_0.pth [INFO] Saving model and optimizer state at iteration 1 to ./logs/ljs_base/D_0.pth Traceback (most recent call last): File "train.py", line 291, in main() File "train.py", line 50, in main mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,)) File "/home/zhang/anaconda3/envs/vits/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/home/zhang/anaconda3/envs/vits/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes while not context.join(): File "/home/zhang/anaconda3/envs/vits/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join raise Exception(msg) Exception:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/home/zhang/anaconda3/envs/vits/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap fn(i, *args) File "/home/zhang/compatibility_analysis/vits/train.py", line 117, in run train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler, [train_loader, eval_loader], logger, [writer, writer_eval]) File "/home/zhang/compatibility_analysis/vits/train.py", line 138, in train_and_evaluate for batch_idx, (x, x_lengths, spec, spec_lengths, y, y_lengths) in enumerate(train_loader): File "/home/zhang/anaconda3/envs/vits/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 363, in next data = self._next_data() File "/home/zhang/anaconda3/envs/vits/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 971, in _next_data return self._process_data(data) File "/home/zhang/anaconda3/envs/vits/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1014, in _process_data data.reraise() File "/home/zhang/anaconda3/envs/vits/lib/python3.7/site-packages/torch/_utils.py", line 395, in reraise raise self.exc_type(msg) EOFError: Caught EOFError in DataLoader worker process 1. Original Traceback (most recent call last): File "/home/zhang/anaconda3/envs/vits/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 185, in _worker_loop data = fetcher.fetch(index) File "/home/zhang/anaconda3/envs/vits/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/zhang/anaconda3/envs/vits/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/zhang/compatibility_analysis/vits/data_utils.py", line 94, in getitem return self.get_audio_text_pair(self.audiopaths_and_text[index]) File "/home/zhang/compatibility_analysis/vits/data_utils.py", line 62, in get_audio_text_pair spec, wav = self.get_audio(audiopath) File "/home/zhang/compatibility_analysis/vits/data_utils.py", line 74, in get_audio spec = torch.load(spec_filename) File "/home/zhang/anaconda3/envs/vits/lib/python3.7/site-packages/torch/serialization.py", line 585, in load return _legacy_load(opened_file, map_location, pickle_module, pickle_load_args) File "/home/zhang/anaconda3/envs/vits/lib/python3.7/site-packages/torch/serialization.py", line 755, in _legacy_load magic_number = pickle_module.load(f, pickle_load_args) EOFError: Ran out of input

I just modify epoch=1 in ljs_base.json and reduce .wav files in ljs_audio_text_train_filelist.txt.cleaned. because if I run train command with entire dataset, cuda out of memory will happen. Can you help me fix this problem.

nikich340 commented 1 year ago

Removing wavs won't help you with CUDA OOM (mels are loaded in ram first), you should reduce batch size instead (= how much mels are loaded in gpu vram at once).

nikich340 commented 1 year ago

Update: it seems that "Ran out of input" error is because of RAM overload, cause by too many workers replicating loaded audios in RAM. https://github.com/jaywalnut310/vits/pull/118/commits/1c6cd68b1287fad7782eec6d88012ea5ce09d614

vidigal commented 1 year ago

Decrease the "num_workers": 8 (config/model.json) worked form me. I was using 16 Thanks @nikich340

jaywalnut310 / vits

Please help me! #72