Closed bwang482 closed 4 years ago
Issue-Label Bot is automatically applying the label feature_request
to this issue, with a confidence of 0.64. Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback!
Links: app homepage, dashboard and code for this bot.
Hi
In theory the minimum length of the signals for the feature extraction would be of 200ms for glottal, 80 ms for phonation, 160 ms for articulation, 250 ms for prosody and 500 ms for phonological.
However, you should guarantee at least for articulation and prosody that the file has silence segments, in order to compute the features associated to them.
From a practical point of view I would say the minimum length to have robust statistics from an utterance would be 0.5 seconds
Thanks very much @jcvasquezc !
I have noticed it's very! time-consuming to extract glottal features (less so for phonological features but I still expect days). Is this normal or am I missing any setting here? (I am running both feature extraction on RTX GPUs).
@jcvasquezc
In theory the minimum length of the signals for the feature extraction would be of 200ms for glottal, 80 ms for phonation, 160 ms for articulation, 250 ms for prosody and 500 ms for phonological.
That is interesting information. Can you provide references for these numbers, I would love to read up more about them. I recall that opensmile needs 100 ms at least for prosody but I could be mistaken in my recollection
@bluemonk482
You are right, the computation of the glottal features could be very time consuming because there is an iterative adaptive process to reconstruct the residual signal through an inverse filter. This process is not yet GPU-optimized
On the contrary, the phonological features are less time-consuming because the RNN trained to extract the phonological posteriors are trained with GPU support.
@happypanda5 This information is more from the implementation point of view, and also because specifically for prosody it is better to have long-term utterances for a better modeling of the pitch contour, speech rate, and duration of pauses, etc.
You can find additional information about this topic here
https://www5.informatik.uni-erlangen.de/Forschung/Publikationen/2016/Belalcazar16-GFP.pdf https://www5.informatik.uni-erlangen.de/Forschung/Publikationen/2018/Vasquez-Correa18-TAA.pdf https://gita.udea.edu.co/uploads/1405-Phonet.pdf
Actually sorry for keep nagging you with different questions @jcvasquezc
Just a quick question, have you come across this warning below? I wonder if this affects the computation of prosody features?
/mnt/sdb/Tools/DisVoice/prosody/prosody.py:274: RankWarning: Polyfit may be poorly conditioned
features=self.prosody_dynamic(audio)
Hi
I also experienced that warning several times, and it happens because the length of some of the detected voiced segments is too short to compute some statistics, regarding polynomial regression of the fundamental frequency for that specific segment. Usually the warning does not affect too much the computation of the features.
Thanks!
Hi this is a really useful library for extracting interpretable speech features! Thanks!!
I want to ask about the minimum length of the input audio that goes into each of the feature extraction functions. It seems for the prosody features, the input has to be longer than 0.6 sec?
And this is the same for phonation features?
Thanks again.