hayeong0 / Diff-HierVC

Official Pytorch Implementation of "Diff-HierVC: Diffusion-based Hierarchical Voice Conversion with Robust Pitch Generation and Masked Prior for Zero-shot Speaker Adaptation"
188 stars 18 forks source link

python train.py #7

Open lareina-a opened 3 months ago

lareina-a commented 3 months ago

I have found it

lareina-a commented 3 months ago

python train.py -c ckpt/config.json -m mymodel INFO:mymodel:{'train': {'log_interval': 1000, 'eval_interval': 10000, 'save_interval': 10000, 'seed': 1234, 'epochs': 1000, 'optimizer': 'adamw', 'lr_decay_on': True, 'learning_rate': 5e-05, 'betas': [0.8, 0.99], 'eps': 1e-09, 'batch_size': 32, 'fp16_run': False, 'lr_decay': 0.999875, 'segment_size': 35840, 'init_lr_ratio': 1, 'warmup_epochs': 0, 'c_mel': 1, 'aug': True, 'lambda_commit': 0.02}, 'data': {'sampling_rate': 16000, 'filter_length': 1280, 'hop_length': 320, 'win_length': 1280, 'n_mel_channels': 80, 'mel_fmin': 0, 'mel_fmax': 8000}, 'model': {'inter_channels': 192, 'hidden_channels': 192, 'filter_channels': 768, 'n_heads': 2, 'n_layers': 6, 'kernel_size': 3, 'p_dropout': 0.1, 'resblock': '1', 'resblock_kernel_sizes': [3, 7, 11], 'resblock_dilation_sizes': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'upsample_rates': [5, 4, 4, 2, 2], 'upsample_initial_channel': 512, 'upsample_kernel_sizes': [11, 8, 8, 4, 4], 'mixup_ratio': 0.6, 'n_layers_q': 3, 'use_spectral_norm': False, 'hidden_size': 128}, 'diffusion': {'dec_dim': 64, 'spk_dim': 128, 'beta_min': 0.05, 'beta_max': 20.0}, 'model_dir': '/workspace/raid/ha0/logs_diffhier/mymodel'} WARNING:mymodel:/root/autodl-tmp/Diff-HierVC-master/utils is not a git repository, therefore hash value comparison will be ignored. INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0 INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes. Traceback (most recent call last): File "train.py", line 275, in main() File "train.py", line 42, in main mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,)) File "/root/miniconda3/envs/diff/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/root/miniconda3/envs/diff/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes while not context.join(): File "/root/miniconda3/envs/diff/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/root/miniconda3/envs/diff/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, *args) File "/root/autodl-tmp/Diff-HierVC-master/train.py", line 68, in run train_dataset = AudioDataset(hps, training=True) File "/root/autodl-tmp/Diff-HierVC-master/utils/data_loader.py", line 22, in init self.filelist_path = config.data.train_filelist_path \ AttributeError: 'HParams' object has no attribute 'train_filelist_path' Can you help me and tell me how to solve it?

hayeong0 commented 3 months ago

Hello, The filelist path should be specified in the config. The config for training is similar to the ckpt but with the file path section added and updated (config/config.json). An example of the file list folder is as follows:

Each text file should include the paths for the wav, F0 and norm F0 files.

Thank you.

lareina-a commented 3 months ago

wav、F0 和规范 F0 文件


Thank you for your response. Could you please let me know which dataset you are using? Is it possible to share it? Additionally, do we need to generate the wav, F0, and normalized F0 files ourselves?

Thank you.

markrmiller commented 2 months ago

Yes, the wav files you have to collect. There are lots of datasets online. Then, use something like crepe to generate the pitch embedding from them. It seems like it expects 2 dimension pitch embeddings, so I'm just unsqeezing a new zero dimension and hoping that works. Then you collect the mean and std dev and zscore them for the normalized pitch embeddings.

All that seems to mostly work, keeping in mind there is a minimum length the wavs have to be based on the segment size.

I still hit an issue though - during evaluation, it fails in the encoder because the pitch embedding mask somehow ends up being like twice the length of the pitch embedding, and when they are multiplied together it fails. Have not been able to figure out what is wrong yet. Perhaps unsqueezing a dimension for the pitch embedding is not correct, and the pitch embedding is supposed to be some different 2 dimension structure.

lareina-a commented 2 months ago

During evaluation, the length of the pitch embedding mask does not match the length of the pitch embedding itself, leading to a failure. How can this issue be resolved? If you have a solution, could you please share it?

markrmiller commented 2 months ago

I have no clue. All I can guess is that the pitch embeddings are supposed to be in some format that I don't know.

I made some changes that get it past evaluation, but then it dies with a similar issue in training.

Obviously, the code is wrong, or the data is, so I'm guessing it's the pitch embeddings.

This code is based on the GradTTS code, like a couple dozen other voice conversion models, and typically, I haven't had much of an issue with the pitch embeddings in some of these other models, so I don't know whats up.

hayeong0 commented 2 months ago

As described in the paper, we use F0 information with four times higher resolution compared to Mel. Therefore, the F0 mask is four times longer than the Mel segment mask. Since the hop size is 320, we used segment length // 80 in the data loader.

markrmiller commented 2 months ago

Yeah, that's my fault. I haven't read the paper in months. Thank you.

markrmiller commented 2 months ago

Thanks for making the training code available, by the way! I'm really looking forward to playing with this model.

Just have to produce about 700,000 new pitch embeddings. I'm only using 2 RTX 3090s, so I'm sure I have quite a bit of training time to go through.

I converted the diffusion model in GradSVC to use diffusers (discrete time steps) and latent space to dramatically speed up training and use use diffusers schedulers so I may take a look at that here, perhaps, but if its not relatively straight forward to reuse that, I'll probably just eat the super long time.

hayeong0 commented 2 months ago

You're conducting interesting work! I plan to use a diffuser as well. I have used YAAPT, but if you have a large amount of data for training, I recommend using a relatively fast pitch extractor like Parselmouth for real-time extraction!

generalkeno-b commented 1 month ago

Modified the data loader, had my f0 files as .npy files so f0 = torch.load(f0_path) was not working, used f0 = torch.from_numpy(np.load(f0_path)) f0 = torch.unsqueeze(f0, 0) # to match the dimensions instead. Now,, max_f0_start = f0.shape[-1] - self.segment_length//80 is giving the max_f0_start as a negative value in some cases, and therefore I'm getting this error - File "/home/aditya/Diff_Hier_VC/Diff-HierVC/train.py", line 283, in main() File "/home/aditya/Diff_Hier_VC/Diff-HierVC/train.py", line 41, in main mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,)) File "/root/miniconda3/envs/dhvc/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/root/miniconda3/envs/dhvc/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes while not context.join(): File "/root/miniconda3/envs/dhvc/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/root/miniconda3/envs/dhvc/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, *args) File "/home/aditya/Diff_Hier_VC/Diff-HierVC/train.py", line 125, in run train_and_evaluate(rank, epoch, hps, [model, mel_fn, w2v, aug, net_v], optimizer, File "/home/aditya/Diff_Hier_VC/Diff-HierVC/train.py", line 144, in train_and_evaluate for batch_idx, (x, norm_f0, x_f0, length) in enumerate(train_loader): File "/root/miniconda3/envs/dhvc/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 530, in next data = self._next_data() File "/root/miniconda3/envs/dhvc/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1224, in _next_data return self._process_data(data) File "/root/miniconda3/envs/dhvc/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1250, in _process_data data.reraise() File "/root/miniconda3/envs/dhvc/lib/python3.10/site-packages/torch/_utils.py", line 457, in reraise raise exception ValueError: Caught ValueError in DataLoader worker process 0. Original Traceback (most recent call last): File "/root/miniconda3/envs/dhvc/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop data = fetcher.fetch(index) File "/root/miniconda3/envs/dhvc/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/root/miniconda3/envs/dhvc/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 49, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/aditya/Diff_Hier_VC/Diff-HierVC/utils/data_loader.py", line 66, in getitem f0_start = np.random.randint(0, max_f0_start) File "mtrand.pyx", line 748, in numpy.random.mtrand.RandomState.randint File "_bounded_integers.pyx", line 1247, in numpy.random._bounded_integers._rand_int64 ValueError: high <= 0

any fixes?