Open alexis-michaud opened 6 years ago
So lately I have been thinking a bit about something very similar in a completely different context (I'm wanting to detect the sound of a billiards shot) and one thing that comes to mind is that there's a possible filtering step here. Now that I hear about these phonemic clicks I'm wondering if such a preprocessing step can be run in a language independent manner or if there's additional work that would need to be done to distinguish from this case?
Amanda Miller (Ohio State University) did automatic detection of clicks. The method is reported in section 4 of this paper:
Because clicks are relatively short in duration and high in amplitude, the tool searches the acoustic signal in 1 ms frames. At each frame, a potential click is detected if the raw signal amplitude exceeds 0.3 Pascal and the Center of Gravity exceeds 1500 Hz. If the region of consecutive frames which passes these filters has a duration less than 20 ms, it is labeled as a click. For MK, the center of gravity cutoff is changed to 1000 Hz and the durations allowed to extend to 25 ms. (These parameters were tuned impressionistically.)
Accuracy is about 65%. My impression is that with a signal processing specialist on board it should be possible to get much higher recall & precision with more elaborate detection, distinguishing the various clicks, even distinguishing billiards shots (and clinking glasses, clinking bracelets...) from linguistic clicks.
The Yongning Na data set does not have billiards shots, and I don't think it contains any clicks. But it has clinking bracelets, as the main consultant wears jade bracelets and these tingle pretty often in the recordings. This sound should be identifiable: by ear it's easy, and in the signal it stands out visually.
Just a thought about this, what's the feasibility of training a model with some of the background noises used as training data for non-speech output tokens? This may be a way to work around some issues if there's a very specific background non-speech noise, how feasible is such an approach?
Yes, this is a standard approach used in ASR. Typically there will be a group of non-language symbols that represent different noises and the model learns to transcribe them, then they are just removed from the final transcription.
Examples of such symbols in a Kaldi recipe I was recently working with:
<hes> (hesitations)
<noise>
<silence>
<unk> (Unknown word)
This issue has some similarities with #39 "Team up with signal processing expert for tests on different audio qualities?" in that it's about a topic that requires work on the signal.
Non-speech events include: clicks, filled pauses, sighs, breathing, grunting, whistling (one's admiration, surprise)... Their frequency in human communication depends on the setting: from near-zero in carefully read speech (just the slightest sound of taking breath, as inconspicuously as possible), to maybe as much as half of the information in certain genres / contexts. For Na, filled pauses are indicated in the transcription, and included in the training. For other data sets (of other languages) in which non-speech events are not transcribed, a tool that identifies those events could be useful, trying to operate it in 'language-independent' mode (=without language-specific training).
It could
Just a couple of ideas among many possible topics:
most languages don't have phonemic clicks (IPA: ʘ ǀ ǃ ǂ ǁ) but all speakers use clicks with readily identifiable meanings in (presumably) all cultures in the world. If those were included in the output of Persephone, it would be a hint to researchers that there's something that can usefully be added to the transcription. If there existed many data sets with clicks transcribed, someone might want to compare the use of clicks in the world's languages using this as part of the empirical basis.
the main consultant for Yongning Na was recorded from age 56 to the present = over a span of 11 years & counting; colleagues interested in 'voice ageing' (changes in the human voice over the life span) could compare productions recorded every year since 2006 and see how breathing patterns in speech change over the years.
This would be a setting at user's choice: "clean transcription" mode (discarding everything that is not phonemic) or "full-communicational" mode (including recognition of clicks, giggling/laughing, sighs...). Which events Persephone would recognize, and how they would be represented in the output, requires some thinking, of course.
(Note: as I understand, the idea of looking at 'non-phonemic' events in the acoustic signal is widespread in the Automatic Speech Recognition literature; specifically, Laurent Besacier drew my attention to this c. 2014, and Martine Adda-Decker pointed this out as a possible path for further improvements in her comments on the LREC paper.)