Closed mrgloom closed 2 years ago
The easiest thing one can do is to offset f0
.
For example, one can first extract f0, sp, ap
from a male speech using WORLD vocoder, then do f0_new = f0 + 100
to raise its pitch, and finally synthesize a new voice using f0_new, sp, ap
(again, using WORLD).
Offsetting f0
is usually not enough (the resulting voices may sound funny) and a transformation of sp
is thus required.
We usually train a machine learning model to obtain such a complicated transformation.
See, for example, a demo and its accompanying paper.
This is still an active research topic and you can find more by searching voice conversion.
There also exists other voice conversion methods that does not require a pre-defined vocoders, but I'll stop here because it deviates from your question.
I have tried to add constant to F0 and it sounds like pitch shift(ffmpeg -i man.wav -af asetrate=48000*3/4,atempo=4/3 man_pitch_down.wav
), maybe better.
Actually I'm looking for some 'presets' for f0, sp, ap
that can work on average and may produce 'funny' voices, something like https://www.voicemod.net
Thanks for sharing. Voice morphing can be achieved via signal processing techniques (such as resampling that you mentioned above). Vocoder is not necessary in most cases. You might want to check out PSOLA, WSOLA, etc. I am not familiar with how those presets were made (probably by try-and-error with a voice morphing toolkit). If anyone knows, we'll be grateful for comments.
Hi, can i also ask more about the ap feature, why don't we need to convert ap along with f0 and sp for voice conversion ? Thank you.
@tranctan Hello, My basic understanding is that the aperiodicity is the "noise" in the speech model. It is produced by the air blowing through the mouth, which is the same for all people. In other words, only f0 and sp contain voice characteristics.
@neverix Wow thanks for your kind reply !
I actually did a quick research on aperiodicity and it seemed to relate to something called "mixed excitation". But i got stuck with that term either. One more thing, i don't know if aperiodicity relates to prosody (how we speak) or not.
@tranctan According to Wikipedia, prosody consists of rhythm (the t
variable), pitch (f0
), loudness and timbre (entangled inside sp
). So if you want to do voice conversion with prosody modification, changing ap
itself should be unnecessary.
However, note that I don't know the specifics of how WORLD works, so ap
might contain something other than the classical harmonics to noise ratio and require modification.
Off-topic: I'm working on a similar project right now, you can e-mail me if you want to collaborate.
@JeremyCCHsu, Can you please let me know any other ways too, to convert female to male and vic-e-versa.
For anyone interested in voice conversion, my recommendation is as follows:
Read the summarization article from the biennial Voice Conversion Challenge (e.g., 2020's). You'll get to know what the main issues (limitations) are, what the obstacles current techniques have, and to what extent the state-of-the-art can achieve. This should give you a good start, and you can follow the work of the cited teams (Google Scholar and Arxiv are your good friend). After tracing a few more recently published articles, you'll get a good idea what is missing between the last VCC and status quo.
Checkout some open-sourced baseline systems (e.g., VCC2020's) and get yourself familiarized with contemporary systems.
The field is changing fast and some of the old suggestions have become obsolete (for example, fewer people choose to model pitch contours nowadays). I hope this message will benefit someone in the future.
How to modify fundamental frequency (F0), spectral envelope and aperiodicity parameters to convert male voice to female and vice versa?