p0p4k / vits2_pytorch

unofficial vits2-TTS implementation in pytorch
https://arxiv.org/abs/2307.16430
MIT License
465 stars 81 forks source link

add vctk training pipeline #61

Closed choiHkk closed 9 months ago

choiHkk commented 9 months ago

I have added the code related to the duration discriminator, residual coupling layer, and training pipeline mentioned in the previous issue. #59

Here are the changes:

vctk_test.wav

vits2_vctk_standard.json

data_utils.py

inference.ipynb

mel_processing.py

models.py

train_ms.py

The above sentences were generated by ChatGPT.

p0p4k commented 9 months ago

Hi, thanks for the PR. Really good documented. One more thing is I just fixed a typo in mono_layer_flow, see that and add it to this commit. Thanks.

choiHkk commented 9 months ago

@p0p4k The onnx conversion has just been successfully completed, and the inference has been carried out perfectly. The training stop at step 91k, and I will share the onnx file with Google Drive link below.

https://drive.google.com/drive/folders/1cWMiXSVGarHcVLaOzl568ndj4FuHEPUp?usp=sharing

choiHkk commented 9 months ago

@p0p4k I have discovered something amazing. Assuming the use of the ResidualCouplingTransformersLayer2 module, it seems that voice conversion is possible because latent variables do not deviate significantly from the distribution designed in the original VITS. I have added the samples inside the resources directory.

p0p4k commented 9 months ago

@choiHkk about voice conversion, it is quite possible that the "g" sent in text_encoder ends up not being used at all. I have seen in some of my experiments that the "g" sent in mel_encoder was being ignored as well. You can try to test these things, if you need help let me know. Add me on discord - p0p4k.

chengwuxinlin commented 8 months ago

@p0p4k The onnx conversion has just been successfully completed, and the inference has been carried out perfectly. The training stop at step 91k, and I will share the onnx file with Google Drive link below.

https://drive.google.com/drive/folders/1cWMiXSVGarHcVLaOzl568ndj4FuHEPUp?usp=sharing

Hello, thank you for the wonderful work. An error popped out when I ran your onnx file, could you please help me check it? So I downloaded onnx and json files from the Google link, and ran:

python infer_onnx.py --model="./pretrained_91k.onnx" --config-path="./vits2_vctk_standard.json" --output-wav-path="./trained_models/output.wav" --text="hello world, how are you?"

Then this error shown:

`2023-10-26 23:47:59.197006110 [E:onnxruntime:, sequential_executor.cc:514 ExecuteKernel] Non-zero status code returned while running Gather node. Name:'/emb_g/Gather' Status Message: /onnxruntime_src/include/onnxruntime/core/framework/op_kernel_context.h:42 const T* onnxruntime::OpKernelContext::Input(int) const [with T = onnxruntime::Tensor] Missing Input: sid

Traceback (most recent call last): File "infer_onnx.py", line 59, in main() File "infer_onnx.py", line 45, in main audio = model.run( File "/vits2_pytorch/new_env/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 217, in run return self._sess.run(output_names, input_feed, run_options) onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running Gather node. Name:'/emb_g/Gather' Status Message: /onnxruntime_src/include/onnxruntime/core/framework/op_kernel_context.h:42 const T* onnxruntime::OpKernelContext::Input(int) const [with T = onnxruntime::Tensor] Missing Input: sid `

p0p4k commented 8 months ago

I think I need an extra argument, speaker id

choiHkk commented 8 months ago

@chengwuxinlin I think you should use the argument 'sid' when performing inference using ONNX, as follows:

 parser.add_argument("--sid", required=False, type=int, help="Speaker ID to synthesize")
chengwuxinlin commented 8 months ago

I think I need an extra argument, speaker id

sorry, my bad. I accidentally deleted the speaker id in the input. It's all good now, thank you for the fast response.