mmorise / World

A high-quality speech analysis, manipulation and synthesis system
http://www.kisc.meiji.ac.jp/~mmorise/world/english
Other
1.17k stars 251 forks source link

Male, female voice convertion? #81

Closed mrgloom closed 2 years ago

mrgloom commented 5 years ago

How to modify fundamental frequency (F0), spectral envelope and aperiodicity parameters to convert male voice to female and vice versa?

JeremyCCHsu commented 5 years ago
  1. The easiest thing one can do is to offset f0. For example, one can first extract f0, sp, ap from a male speech using WORLD vocoder, then do f0_new = f0 + 100 to raise its pitch, and finally synthesize a new voice using f0_new, sp, ap (again, using WORLD).

  2. Offsetting f0 is usually not enough (the resulting voices may sound funny) and a transformation of sp is thus required. We usually train a machine learning model to obtain such a complicated transformation. See, for example, a demo and its accompanying paper. This is still an active research topic and you can find more by searching voice conversion.

  3. There also exists other voice conversion methods that does not require a pre-defined vocoders, but I'll stop here because it deviates from your question.

mrgloom commented 5 years ago

I have tried to add constant to F0 and it sounds like pitch shift(ffmpeg -i man.wav -af asetrate=48000*3/4,atempo=4/3 man_pitch_down.wav), maybe better.

Actually I'm looking for some 'presets' for f0, sp, ap that can work on average and may produce 'funny' voices, something like https://www.voicemod.net

JeremyCCHsu commented 5 years ago

Thanks for sharing. Voice morphing can be achieved via signal processing techniques (such as resampling that you mentioned above). Vocoder is not necessary in most cases. You might want to check out PSOLA, WSOLA, etc. I am not familiar with how those presets were made (probably by try-and-error with a voice morphing toolkit). If anyone knows, we'll be grateful for comments.

tranctan commented 4 years ago

Hi, can i also ask more about the ap feature, why don't we need to convert ap along with f0 and sp for voice conversion ? Thank you.

neverix commented 4 years ago

@tranctan Hello, My basic understanding is that the aperiodicity is the "noise" in the speech model. It is produced by the air blowing through the mouth, which is the same for all people. In other words, only f0 and sp contain voice characteristics.

tranctan commented 4 years ago

@neverix Wow thanks for your kind reply !

I actually did a quick research on aperiodicity and it seemed to relate to something called "mixed excitation". But i got stuck with that term either. One more thing, i don't know if aperiodicity relates to prosody (how we speak) or not.

neverix commented 4 years ago

@tranctan According to Wikipedia, prosody consists of rhythm (the t variable), pitch (f0), loudness and timbre (entangled inside sp). So if you want to do voice conversion with prosody modification, changing ap itself should be unnecessary. However, note that I don't know the specifics of how WORLD works, so ap might contain something other than the classical harmonics to noise ratio and require modification. Off-topic: I'm working on a similar project right now, you can e-mail me if you want to collaborate.

Aksh97 commented 2 years ago

@JeremyCCHsu, Can you please let me know any other ways too, to convert female to male and vic-e-versa.

JeremyCCHsu commented 2 years ago

For anyone interested in voice conversion, my recommendation is as follows:

  1. Read the summarization article from the biennial Voice Conversion Challenge (e.g., 2020's). You'll get to know what the main issues (limitations) are, what the obstacles current techniques have, and to what extent the state-of-the-art can achieve. This should give you a good start, and you can follow the work of the cited teams (Google Scholar and Arxiv are your good friend). After tracing a few more recently published articles, you'll get a good idea what is missing between the last VCC and status quo.

  2. Checkout some open-sourced baseline systems (e.g., VCC2020's) and get yourself familiarized with contemporary systems.

The field is changing fast and some of the old suggestions have become obsolete (for example, fewer people choose to model pitch contours nowadays). I hope this message will benefit someone in the future.