add vctk training pipeline

choiHkk commented 9 months ago

I have added the code related to the duration discriminator, residual coupling layer, and training pipeline mentioned in the previous issue. #59

Here are the changes:

vctk_test.wav

I have added the speech synthesis results generated with a model checkpoint trained at 51k steps.

vits2_vctk_standard.json

Added "dur_disc_2" as requested by the owner, separate from the previously merged config.

data_utils.py

Added print() to _filter() to see the difference in the number of files included in the actual dataset and the dataset used for training.
Added a filtering feature to prevent errors when an audio shorter than segment_size is received during random segmentation.
Only modified parts related to multi-speaker training.
Applied black and isort.

inference.ipynb

Modified and added the existing jupyter notebook from the repository.

mel_processing.py

Applied black and isort.

models.py

Added DurationDiscriminator2.
- This is the DurationDiscriminator2 mentioned in #59 .
- The structure is almost the same as the existing one, but it returns a 2-dimensional list type in the forward pass to allow the use of the existing adversarial loss function.
Added "AVAILABLE_DURATION_DISCRIMINATOR_TYPES" to effectively select and use the discriminator in the training pipeline.
Added the "pre_conv2" argument to allow the use of "ResidualCouplingTransformersLayer2".
- According to the paper's structure, a transformer block is designed to be used just before the convolution block, similar to how it is implemented in the owner's ResidualCouplingTransformersLayer.
- Modified to be used in the ResidualCouplingTransformersBlock.
  - Can be used by changing it from hps.model.transformer_flow_type that was previously written.
- I set the kernel size of the transformer to be the same as the kernel size used in normalizing flows.
  - Additionally, to prevent out-of-memory issues, I reduced the size of the transformer block.
  - According to the authors of the paper, they added "a small transformer block".
  - I set the number of heads to 2 and the number of layers to 1.
- I have reapplied the relative positional embedding used in the text encoder.
  - The reason for adding transformer blocks to the normalizing flows is to capture long-term dependencies,
  - I believe it will be more efficient than absolute positional embedding.
Applied black and isort.

train_ms.py

Determines the duration discriminator similar to the code style written by the owner, based on "AVAILABLE_DURATION_DISCRIMINATOR_TYPES" written in models.py.
However, if the input method of hyperparameters for the duration discriminator changes, this method may need to be modified.
Added float() inside spec_to_mel_torch().
- It is not a big problem if mixed precision (fp16) is not used or if the torch version is lower than 2.0.0.
- However, if fp16 is applied or the torch version is higher than 2.0.0, numerical stability may not be good, so the data type of spec is changed to float.
Applied black and isort.

The above sentences were generated by ChatGPT.

p0p4k commented 9 months ago

Hi, thanks for the PR. Really good documented. One more thing is I just fixed a typo in mono_layer_flow, see that and add it to this commit. Thanks.

choiHkk commented 9 months ago

@p0p4k The onnx conversion has just been successfully completed, and the inference has been carried out perfectly. The training stop at step 91k, and I will share the onnx file with Google Drive link below.

https://drive.google.com/drive/folders/1cWMiXSVGarHcVLaOzl568ndj4FuHEPUp?usp=sharing

choiHkk commented 9 months ago

@p0p4k I have discovered something amazing. Assuming the use of the ResidualCouplingTransformersLayer2 module, it seems that voice conversion is possible because latent variables do not deviate significantly from the distribution designed in the original VITS. I have added the samples inside the resources directory.

p0p4k commented 9 months ago

@choiHkk about voice conversion, it is quite possible that the "g" sent in text_encoder ends up not being used at all. I have seen in some of my experiments that the "g" sent in mel_encoder was being ignored as well. You can try to test these things, if you need help let me know. Add me on discord - p0p4k.

chengwuxinlin commented 8 months ago

@p0p4k The onnx conversion has just been successfully completed, and the inference has been carried out perfectly. The training stop at step 91k, and I will share the onnx file with Google Drive link below.

https://drive.google.com/drive/folders/1cWMiXSVGarHcVLaOzl568ndj4FuHEPUp?usp=sharing

Hello, thank you for the wonderful work. An error popped out when I ran your onnx file, could you please help me check it? So I downloaded onnx and json files from the Google link, and ran:

python infer_onnx.py --model="./pretrained_91k.onnx" --config-path="./vits2_vctk_standard.json" --output-wav-path="./trained_models/output.wav" --text="hello world, how are you?"

Then this error shown:

`2023-10-26 23:47:59.197006110 [E:onnxruntime:, sequential_executor.cc:514 ExecuteKernel] Non-zero status code returned while running Gather node. Name:'/emb_g/Gather' Status Message: /onnxruntime_src/include/onnxruntime/core/framework/op_kernel_context.h:42 const T* onnxruntime::OpKernelContext::Input(int) const [with T = onnxruntime::Tensor] Missing Input: sid

Traceback (most recent call last): File "infer_onnx.py", line 59, in main() File "infer_onnx.py", line 45, in main audio = model.run( File "/vits2_pytorch/new_env/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 217, in run return self._sess.run(output_names, input_feed, run_options) onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running Gather node. Name:'/emb_g/Gather' Status Message: /onnxruntime_src/include/onnxruntime/core/framework/op_kernel_context.h:42 const T* onnxruntime::OpKernelContext::Input(int) const [with T = onnxruntime::Tensor] Missing Input: sid `

p0p4k commented 8 months ago

I think I need an extra argument, speaker id

choiHkk commented 8 months ago

@chengwuxinlin I think you should use the argument 'sid' when performing inference using ONNX, as follows:

 parser.add_argument("--sid", required=False, type=int, help="Speaker ID to synthesize")

chengwuxinlin commented 8 months ago

I think I need an extra argument, speaker id

sorry, my bad. I accidentally deleted the speaker id in the input. It's all good now, thank you for the fast response.

p0p4k / vits2_pytorch