Minimum length of input audio segment

jcvasquezc / DisVoice

feature extraction from speech signals

https://disvoice.readthedocs.io/en/latest/

MIT License

344 stars 77 forks source link

Minimum length of input audio segment #19

Closed bwang482 closed 4 years ago

bwang482 commented 4 years ago

Hi this is a really useful library for extracting interpretable speech features! Thanks!!

I want to ask about the minimum length of the input audio that goes into each of the feature extraction functions. It seems for the prosody features, the input has to be longer than 0.6 sec?

        pitchON = np.where(F0!=0)[0]
        dchange = np.diff(pitchON)
        change = np.where(dchange>1)[0]
        iniV = pitchON[0]

And this is the same for phonation features?

Thanks again.

issue-label-bot[bot] commented 4 years ago

Issue-Label Bot is automatically applying the label feature_request to this issue, with a confidence of 0.64. Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

jcvasquezc commented 4 years ago

In theory the minimum length of the signals for the feature extraction would be of 200ms for glottal, 80 ms for phonation, 160 ms for articulation, 250 ms for prosody and 500 ms for phonological.

However, you should guarantee at least for articulation and prosody that the file has silence segments, in order to compute the features associated to them.

From a practical point of view I would say the minimum length to have robust statistics from an utterance would be 0.5 seconds

bwang482 commented 4 years ago

Thanks very much @jcvasquezc !

I have noticed it's very! time-consuming to extract glottal features (less so for phonological features but I still expect days). Is this normal or am I missing any setting here? (I am running both feature extraction on RTX GPUs).

happypanda5 commented 4 years ago

@jcvasquezc

In theory the minimum length of the signals for the feature extraction would be of 200ms for glottal, 80 ms for phonation, 160 ms for articulation, 250 ms for prosody and 500 ms for phonological.

That is interesting information. Can you provide references for these numbers, I would love to read up more about them. I recall that opensmile needs 100 ms at least for prosody but I could be mistaken in my recollection

jcvasquezc commented 4 years ago

@bluemonk482

You are right, the computation of the glottal features could be very time consuming because there is an iterative adaptive process to reconstruct the residual signal through an inverse filter. This process is not yet GPU-optimized

On the contrary, the phonological features are less time-consuming because the RNN trained to extract the phonological posteriors are trained with GPU support.

jcvasquezc commented 4 years ago

@happypanda5 This information is more from the implementation point of view, and also because specifically for prosody it is better to have long-term utterances for a better modeling of the pitch contour, speech rate, and duration of pauses, etc.

You can find additional information about this topic here

https://www5.informatik.uni-erlangen.de/Forschung/Publikationen/2016/Belalcazar16-GFP.pdf https://www5.informatik.uni-erlangen.de/Forschung/Publikationen/2018/Vasquez-Correa18-TAA.pdf https://gita.udea.edu.co/uploads/1405-Phonet.pdf

bwang482 commented 4 years ago

Actually sorry for keep nagging you with different questions @jcvasquezc

Just a quick question, have you come across this warning below? I wonder if this affects the computation of prosody features?

/mnt/sdb/Tools/DisVoice/prosody/prosody.py:274: RankWarning: Polyfit may be poorly conditioned
  features=self.prosody_dynamic(audio)

jcvasquezc commented 4 years ago

I also experienced that warning several times, and it happens because the length of some of the detected voiced segments is too short to compute some statistics, regarding polynomial regression of the fundamental frequency for that specific segment. Usually the warning does not affect too much the computation of the features.

bwang482 commented 4 years ago

Thanks!