real-time example - Githubissues

alirezag commented 6 years ago

Hi, I'm new to WORLD. IT is obviously an awesome software but I was wondering how I can use it in realtime, since it is the main point of the original paper. I'm doing some like this now, but the result is very choppy:

def apply(vocoder,fs,x_int16):
    #fs, x_int16 = wavread(wav_path)
    x = x_int16 / (2 ** 15 - 1)

    # analysis
    dat = vocoder.encode(fs, x,f0_method ='harvest')# f0_method='harvest', is_requiem=False) # use requiem analysis and synthesis

    dat = vocoder.scale_pitch(dat, 1.5)

    dat = vocoder.decode(dat)
    return (dat['out'] * 2 ** 15).astype(np.int16)

i'm applying the function to 1024 bytes of the stream that I get. any ideas how I can improve?

tuanad121 commented 6 years ago

Hi Alireza, Thanks for asking question. It's interesting. ^^ Would you my mind elaborate on how your result sounds? Honestly, I never try my version realtime, I'm not sure how it behaves? ^^ My guess is because of WORLD uses pitch-synchronous windows. When a pitch is low, then its corresponding window length is broad. I'm not sure what happens if a window length is higher than the input length. Another thing is Python is not as optimal as C with for loop. So my Python version is slower than original C version. The Harverst module for F0 extraction is the slowest one while other modules are quite fast. We can use faster F0 extraction methods (e.g. set f0_method='dio'. ) instead. So I haven't thought about realtime processing yet ^^. I will take a look on the original work to see what they did ^^.

alirezag commented 6 years ago

hmm, thanks for the comment, so looks like when I switched to dio it is fast enough for realtime, but quality of reconstruction is not good. but I don't really know about acoustics and voice so if you have ideas about how to improve that would be great. my guess is running the algorithm on 2048 long chunks and then piecing them together doesn't work very well probably because on the boundary of each chunk there are residuals that don't match up nicely with the succeeding chunks.

Here is the new code:


def apply(vocoder,fs,x_int16):
    #fs, x_int16 = wavread(wav_path)
    x = x_int16 / (2 ** 15 - 1)

    # analysis
    dat = vocoder.encode(fs, x,f0_method ='dio')# f0_method='harvest', is_requiem=False) # use requiem analysis and synthesis

    if 1:  # global pitch scaling
        dat = vocoder.scale_pitch(dat, 1)
    if 0:  # global duration scaling
        dat = vocoder.scale_duration(dat, 2)
    if 0:  # fine-grained duration modification
        vocoder.modify_duration(dat, [1, 1.5], [0, 1, 3, -1])  # TODO: look into this
    dat = vocoder.decode(dat)
    return (dat['out'] * (2 ** 15 - 1)).astype(np.int16)

here is how I loop over the audio:

while data:  
    # unpack the raw data
    x_int16 = np.array(wave.struct.unpack("%dh"%(len(data)/swidth), data))

    # apply the conversion
    rx_int16 = apply(vocoder,fs,x_int16)

    # pack the reconstructed data
    rdata =  wave.struct.pack("%dh"%(len(rx_int16)), *list(rx_int16))

    # write to audio stream
    stream.write(rdata)  

    # write to file
    fw.writeframes(rdata)

    # read the next chunk
    data = f.readframes(chunk)  
    x_int16 = np.fromstring(data,dtype=np.int16)

Note that i'm just using the pitch shifter and not really changing pitch. just trying to see if the approach of applying this to chunks work. I really need to apply this In realtime.

here is the original audio:

test-mwm.zip

here is the reconstructed audio:

test-mwm-resyn.zip

alirezag commented 6 years ago

hmm, thanks for the comment, so looks like when I switched to dio it is fast enough for realtime, but quality of reconstruction is not good. but I don't really know about acoustics and voice so if you have ideas about how to improve that would be great. my guess is running the algorithm on 2048 long chunks and then piecing them together doesn't work very well probably because on the boundary of each chunk there are residuals that don't match up nicely with the succeeding chunks.

Here is the new code:


def apply(vocoder,fs,x_int16):
    #fs, x_int16 = wavread(wav_path)
    x = x_int16 / (2 ** 15 - 1)

    # analysis
    dat = vocoder.encode(fs, x,f0_method ='dio')# f0_method='harvest', is_requiem=False) # use requiem analysis and synthesis

    if 1:  # global pitch scaling
        dat = vocoder.scale_pitch(dat, 1)
    if 0:  # global duration scaling
        dat = vocoder.scale_duration(dat, 2)
    if 0:  # fine-grained duration modification
        vocoder.modify_duration(dat, [1, 1.5], [0, 1, 3, -1])  # TODO: look into this
    dat = vocoder.decode(dat)
    return (dat['out'] * (2 ** 15 - 1)).astype(np.int16)

here is how I loop over the audio:

while data:  
    # unpack the raw data
    x_int16 = np.array(wave.struct.unpack("%dh"%(len(data)/swidth), data))

    # apply the conversion
    rx_int16 = apply(vocoder,fs,x_int16)

    # pack the reconstructed data
    rdata =  wave.struct.pack("%dh"%(len(rx_int16)), *list(rx_int16))

    # write to audio stream
    stream.write(rdata)  

    # write to file
    fw.writeframes(rdata)

    # read the next chunk
    data = f.readframes(chunk)  
    x_int16 = np.fromstring(data,dtype=np.int16)

Note that i'm just using the pitch shifter and not really changing pitch. just trying to see if the approach of applying this to chunks work. I really need to apply this In realtime.

here is the original audio:

test-mwm.zip

here is the reconstructed audio:

test-mwm-resyn.zip

tuanad121 commented 6 years ago

hmm, thanks for the comment, so looks like when I switched to dio it is fast enough for realtime, but quality of reconstruction is not good.

You're very right that DIO doesn't work as well as Harvest algorithm. The reason is that DIO sometimes misclassifies voiced/unvoiced (V/UV) frames. In unvoiced frames, F0 is 0 and excitation signal is set to noise. Would you mind trying f0_method = 'swipe'? I realized Harvest is slow and Dio is not as good, so I support another algorithm called Swipe. Hopefully, it's helpful.

but I don't really know about acoustics and voice so if you have ideas about how to improve that would be great. my guess is running the algorithm on 2048 long chunks and then piecing them together doesn't work very well probably because on the boundary of each chunk there are residuals that don't match up nicely with the succeeding chunks.

Mmm, I will take a look at the program and come back soon.

alirezag commented 6 years ago

Thanks @tuanad121 . This is the error I get when I switch to swipe:

  File "c:\github-temmp\Python-WORLD\world\cheaptrick.py", line 88, in calculate_windowed_waveform
    half_window_length = int(1.5 * fs / f0 + 0.5)
ValueError: cannot convert float NaN to integer

I looked into it f0 is coming up NaN. The function swip() returns an array of NaN for f0 attribute of its return type. The input file is the test-mwm.zip that I included above.

Here is how I'm calling vocoder:

wav_path = Path('get-samples/test-mwm.wav')
fs, x_int16 = wavread(wav_path)
x = x_int16 / (2 ** 15 - 1)

vocoder = main.World()

# analysis
dat = vocoder.encode(fs, x, f0_method='swipe', is_requiem=False) # use requiem analysis and synthesis

tuanad121 commented 6 years ago

I looked into it f0 is coming up NaN. The function swip() returns an array of NaN for f0 attribute of its return type. The input file is the test-mwm.zip that I included above.

Thanks @alirezag for pointing it out. I have fixed the problem in Swipe. Basically, the Swipe uses NaN to identify unvoiced frames, while WORLD uses zeros to identify the frames. I failed to set the NaN to zero for the output of Swipe.

tuanad121 / Python-WORLD

real-time example #2