Open alirezag opened 6 years ago
Hi Alireza, Thanks for asking question. It's interesting. ^^ Would you my mind elaborate on how your result sounds? Honestly, I never try my version realtime, I'm not sure how it behaves? ^^ My guess is because of WORLD uses pitch-synchronous windows. When a pitch is low, then its corresponding window length is broad. I'm not sure what happens if a window length is higher than the input length. Another thing is Python is not as optimal as C with for loop. So my Python version is slower than original C version. The Harverst module for F0 extraction is the slowest one while other modules are quite fast. We can use faster F0 extraction methods (e.g. set f0_method='dio'. ) instead. So I haven't thought about realtime processing yet ^^. I will take a look on the original work to see what they did ^^.
hmm, thanks for the comment, so looks like when I switched to dio
it is fast enough for realtime, but quality of reconstruction is not good. but I don't really know about acoustics and voice so if you have ideas about how to improve that would be great. my guess is running the algorithm on 2048 long chunks and then piecing them together doesn't work very well probably because on the boundary of each chunk there are residuals that don't match up nicely with the succeeding chunks.
Here is the new code:
def apply(vocoder,fs,x_int16):
#fs, x_int16 = wavread(wav_path)
x = x_int16 / (2 ** 15 - 1)
# analysis
dat = vocoder.encode(fs, x,f0_method ='dio')# f0_method='harvest', is_requiem=False) # use requiem analysis and synthesis
if 1: # global pitch scaling
dat = vocoder.scale_pitch(dat, 1)
if 0: # global duration scaling
dat = vocoder.scale_duration(dat, 2)
if 0: # fine-grained duration modification
vocoder.modify_duration(dat, [1, 1.5], [0, 1, 3, -1]) # TODO: look into this
dat = vocoder.decode(dat)
return (dat['out'] * (2 ** 15 - 1)).astype(np.int16)
here is how I loop over the audio:
while data:
# unpack the raw data
x_int16 = np.array(wave.struct.unpack("%dh"%(len(data)/swidth), data))
# apply the conversion
rx_int16 = apply(vocoder,fs,x_int16)
# pack the reconstructed data
rdata = wave.struct.pack("%dh"%(len(rx_int16)), *list(rx_int16))
# write to audio stream
stream.write(rdata)
# write to file
fw.writeframes(rdata)
# read the next chunk
data = f.readframes(chunk)
x_int16 = np.fromstring(data,dtype=np.int16)
Note that i'm just using the pitch shifter and not really changing pitch. just trying to see if the approach of applying this to chunks work. I really need to apply this In realtime.
here is the original audio:
here is the reconstructed audio:
hmm, thanks for the comment, so looks like when I switched to dio
it is fast enough for realtime, but quality of reconstruction is not good. but I don't really know about acoustics and voice so if you have ideas about how to improve that would be great. my guess is running the algorithm on 2048 long chunks and then piecing them together doesn't work very well probably because on the boundary of each chunk there are residuals that don't match up nicely with the succeeding chunks.
Here is the new code:
def apply(vocoder,fs,x_int16):
#fs, x_int16 = wavread(wav_path)
x = x_int16 / (2 ** 15 - 1)
# analysis
dat = vocoder.encode(fs, x,f0_method ='dio')# f0_method='harvest', is_requiem=False) # use requiem analysis and synthesis
if 1: # global pitch scaling
dat = vocoder.scale_pitch(dat, 1)
if 0: # global duration scaling
dat = vocoder.scale_duration(dat, 2)
if 0: # fine-grained duration modification
vocoder.modify_duration(dat, [1, 1.5], [0, 1, 3, -1]) # TODO: look into this
dat = vocoder.decode(dat)
return (dat['out'] * (2 ** 15 - 1)).astype(np.int16)
here is how I loop over the audio:
while data:
# unpack the raw data
x_int16 = np.array(wave.struct.unpack("%dh"%(len(data)/swidth), data))
# apply the conversion
rx_int16 = apply(vocoder,fs,x_int16)
# pack the reconstructed data
rdata = wave.struct.pack("%dh"%(len(rx_int16)), *list(rx_int16))
# write to audio stream
stream.write(rdata)
# write to file
fw.writeframes(rdata)
# read the next chunk
data = f.readframes(chunk)
x_int16 = np.fromstring(data,dtype=np.int16)
Note that i'm just using the pitch shifter and not really changing pitch. just trying to see if the approach of applying this to chunks work. I really need to apply this In realtime.
here is the original audio:
here is the reconstructed audio:
hmm, thanks for the comment, so looks like when I switched to dio it is fast enough for realtime, but quality of reconstruction is not good.
You're very right that DIO doesn't work as well as Harvest algorithm. The reason is that DIO sometimes misclassifies voiced/unvoiced (V/UV) frames. In unvoiced frames, F0 is 0 and excitation signal is set to noise. Would you mind trying f0_method = 'swipe'? I realized Harvest is slow and Dio is not as good, so I support another algorithm called Swipe. Hopefully, it's helpful.
but I don't really know about acoustics and voice so if you have ideas about how to improve that would be great. my guess is running the algorithm on 2048 long chunks and then piecing them together doesn't work very well probably because on the boundary of each chunk there are residuals that don't match up nicely with the succeeding chunks.
Mmm, I will take a look at the program and come back soon.
Thanks @tuanad121 . This is the error I get when I switch to swipe:
File "c:\github-temmp\Python-WORLD\world\cheaptrick.py", line 88, in calculate_windowed_waveform
half_window_length = int(1.5 * fs / f0 + 0.5)
ValueError: cannot convert float NaN to integer
I looked into it f0 is coming up NaN. The function swip() returns an array of NaN for f0 attribute of its return type. The input file is the test-mwm.zip that I included above.
Here is how I'm calling vocoder:
wav_path = Path('get-samples/test-mwm.wav')
fs, x_int16 = wavread(wav_path)
x = x_int16 / (2 ** 15 - 1)
vocoder = main.World()
# analysis
dat = vocoder.encode(fs, x, f0_method='swipe', is_requiem=False) # use requiem analysis and synthesis
I looked into it f0 is coming up NaN. The function swip() returns an array of NaN for f0 attribute of its return type. The input file is the test-mwm.zip that I included above.
Thanks @alirezag for pointing it out. I have fixed the problem in Swipe. Basically, the Swipe uses NaN to identify unvoiced frames, while WORLD uses zeros to identify the frames. I failed to set the NaN to zero for the output of Swipe.
Hi, I'm new to WORLD. IT is obviously an awesome software but I was wondering how I can use it in realtime, since it is the main point of the original paper. I'm doing some like this now, but the result is very choppy:
i'm applying the function to 1024 bytes of the stream that I get. any ideas how I can improve?