pipilapilayu / TargetSpeakerEnhance

Experimental model based on DPTNet, aiming to extract a specific speaker's voice from noisy audio with high quality at 48kHz.
0 stars 0 forks source link

Extract one voice from audio clip #1

Open prasannapattam opened 7 months ago

prasannapattam commented 7 months ago

I am trying to extract voice of the hero on a audio which also contains other voices and background music

Can your project can do this?

I tried the following:

The extracted output contained all the voices (not just the voice I trained).

Did I miss anything?

med1844 commented 7 months ago

This is still a WIP project (and currently I have little time to work on this). So, yes I would say this is normal behavior.

For comparison, I have not yet succeeded on 5.5 hours of target speaker + ~10 hours of noise. The noise is not removed very well, especially for tonic noise. The average SI-SNR is only -5.9dB. If you listen to it, the result should be mixed with some kind of bitcrush effect.

The current hypothesis is that the model needs to see way much more data before we can finetune it to get high SI-SNR with little data. Training from scratch with little amount of data doesn't seems to teach model well about the target. And I don't have computation resources required to train a good pretrained model.

For now I would recommend you to use MVSEP MDX23C model in UVR5 to do general purpose speech/vocal extraction, then manually filter out segments of target speaker. Their SDR improvement is insane.

prasannapattam commented 7 months ago

I am looking for automated way to extract voice of lead actor. How is DPTNet with respect to extracting target voice. What are the other alternatives to DPTNet?

med1844 commented 7 months ago

I chose DPTNet because of it's high performance on TSE tasks with relatively low #param. But on the hybrid task this repository is investigating into, which mixes both speech enhancement and TSE, with only very little data available, the performance is bad.

Regarding lead actor extraction, if you want to automatically detect the lead actor, afaik there's no out-of-the-box solution. Most TSE research has been done on 8kHz and 16kHz sample rate dataset, which is impossible to use in production.

However, if the lead actor's voice is never mixed with other speakers, you may try this model (sry I couldn't find an English version for either the model or the website). I have never used it, but according to the description you should be able to find speaker id or something similar for each sentence in the inference result. You may utilize that result to find the speaker with longest total speech duration to tell the lead actor, if that's what lead means. You can then use the start and end timestamp reported by the model to extract the lead actor's voice.

If the lead actor's voice is mixed with others, and you don't mind the output to be resampled to 8kHz (only information < 4kHz would be retained), you can take a look at this Conv-TasNet implementation, which comes with a pretrained model that you can use.

I'm not a professional researcher in this area, so please take my suggestions with a grain of salt.

prasannapattam commented 7 months ago

Thanks for your suggestions. I will try these two models.