Process for training the three techniques described by any source speaker. - Githubissues

unilight / seq2seq-vc

A sequence-to-sequence voice conversion toolkit.

MIT License

75 stars 9 forks source link

Process for training the three techniques described by any source speaker. #15

Open PAAYAS opened 1 month ago

PAAYAS commented 1 month ago

Hello, @unilight As part of my work on the LSC model, I want to convert the accent of an other speaker—let's say ASI from the L2ARCTIC dataset—to BDL (the target). The pretrained models that you have provided appear to have been trained using TXHC speakers. Could you give us a thorough training approach so we can train the non-parallel frame-based VCF model or vocoders on any source speakers?

I look forward to hearing from you as soon as possible. I'm grateful.

PAAYAS commented 1 month ago

If you could assist me with training the LSC or cascade model for a different source speaker, that would be greatly appreciated.

unilight commented 4 weeks ago

Hi @PAAYAS, can you try to follow the instructions in the readme here: https://github.com/unilight/seq2seq-vc/tree/main/egs/l2-arctic/lsc, and then see if you have any problem? If you only want to convert "from" a new speaker, it's actually quite simple -- you only need to train the seq2seq model. (If you want to convert "to" a new speaker it's much more troublesome.)

PAAYAS commented 4 weeks ago

Hello @unilight I used an ASI speaker (Source) to train the LSC model, and I set the target to BDL. The voice of the TXHC speaker appeared in the wav files along with artifacts that were produced during the decoding process, rather than the ASI speaker. Could you let me know if I should train any other models with ASI speaker. Thank you.

PAAYAS commented 4 weeks ago

ASI_BDL_LSC.zip Here by I am providing some of the results which I obtained while decoding.

It seems that the decoder which we are using is ppg_sxliu_decoder_THXC and for vocoder is pwg_TXHC (Pretrained models) in LSC model during conversion stage. That seems to be an issue while we convert from a new speaker.

unilight commented 4 weeks ago

@PAAYAS The current methods (all three) cannot convert from a specific new speaker without re-training (or fine-tuning) using the data from that new speaker.

PAAYAS commented 4 weeks ago

@unilight I see now. Could you could assist me with how to fine-tune for a new speaker or retrain models?

unilight commented 4 weeks ago

@PAAYAS Please try to follow the instructions in the readme here: https://github.com/unilight/seq2seq-vc/tree/main/egs/l2-arctic/lsc.

PAAYAS commented 4 weeks ago

@unilight Thank you, will look once again.

PAAYAS commented 4 weeks ago

Greetings, @unilight. As you indicated in https://github.com/unilight/seq2seq-vc/tree/main/egs/l2-arctic, you are employing the S3PRL-VC toolbox for non-parallel frame-based VC model training. Could you please help me with my own dataset training?

unilight commented 4 weeks ago

You can try to follow the instructions at https://github.com/unilight/s3prl-vc/tree/main/egs/TEMPLATE/a2o_vc.

PAAYAS commented 4 weeks ago

Hello @unilight, thank you for your time, I was able to train the non-parallel frame-based VC model on my dataset, but the waveform produced while decoding seems not capturing speaker identity. Could you help me on how we can train the vocoder model on any source speaker?

PAAYAS commented 2 weeks ago

Greetings, @unilight Could you please provide me with instructions on how to convert an accent from multiple speakers to one target speaker?

unilight commented 2 weeks ago

It's not quite possible with the functions provided in this repo.