persephone-tools / persephone

A tool for automatic phoneme transcription
Apache License 2.0
156 stars 26 forks source link

Point out problematic passages in the training transcriptions, for exclusion from training set? #113

Open alexis-michaud opened 6 years ago

alexis-michaud commented 6 years ago

One possible scenario for using Persephone (though not the only one) is to use a linguist's field transcriptions to train an acoustic model, then transcribe new (previously untranscribed) audio files.

The quality of the training corpus (the audio files with transcriptions) is key to obtaining good results. Some general recommendations can be given to linguists, such as:

But since the materials were initially not devised for training ASR tools, there is bound to remain a certain portion of data that is not really suitable for use in Persephone.

So wouldn't it be cool if the tool could point out problematic passages in the training corpus? Thus the linguist can have their attention drawn to those and make a decision: either attend to the issue, if a correction is needed (for instance if a time code is wrong or a word mistranscribed), or decide to exclude the passage (if there is no easy way to avoid the problem: for instance if there is much more code-switching between languages than usual, or if a speaker whispers for a few seconds, or chokes or has a very hoarse voice for a while, or some other kind of departure from the 'norm' that the tool needs to assume).

Or those problematic parts could be excluded automagically from the training set on the basis of automatic evaluation & selection of the best data bits.

This would seem especially useful for data sets that go into the 'comfortable' area in terms of amount of data, allowing the user of Persephone to be picky in the choice of training materials. For instance, Odette Ambouroue has about 10 hours of phonetically transcribed data of Myene, with homogeneous audio (she did 'respeaking' for about half that subset: about 5 hours are thus single-speaker, with overall better audio quality than the field recordings), and tens of hours of untranscribed materials. So it would make great sense for her to try out Persephone. One could try training Persephone over the whole 10 hours of available training data; but it would also be possible to select just the 5 hours of re-spoken data, and even within this subset, to 'pick and choose', excluding those passages that prove to be the hardest nuts to crack (presumably because they differ most from the rest in some respect: speaking rate, expression of emotion...).

'Specifications' for selecting the best bits in a corpus (for inclusion in the optimal training set) could be devised either in 'manual checking' mode or in 'automatic' mode.

'Manual checking' mode would involve automatic listing of the problematic passages, for the linguist to take action (this is akin to error analysis, but instead of listing all errors, the cross-validation materials passed on to the linguist would be arranged by Tone Error Rate/Perception Error Rate at the sentence level, with those sentences with highest TER/PER at top). Then there would need to be some not-too-complicated way for the user to mark certain portions of the training corpus for exclusion. (Maybe not so easy to implement in the software.)

'Automatic checking' mode would exclude materials from the training set when TER/PER at cross-validation is above a certain threshold. Some users may prefer this option, leaving it to the software to figure out which bits from the data set are most suitable for use to train the acoustic model; but it would still be useful to provide the user with a list of the excluded materials, to allow the linguist to think about why those materials were problematic. (There could be various reasons: audio quality, change in speaking style... One man [or software]'s error is another man's 'food for thought' :smile: )

oadams commented 6 years ago

This is a good idea! There'd have to be some experimentation to determine thresholds for the automatic checking mode. Ideally this should be done across a diverse array of datasets. Sometimes it might be useful to remove sentences with a 50% PER, while at other times it might not. This is an interesting research question in its own right.

I feel a good first step in this direction is to offer the cross-validation functionality as part of the interface, so the linguist can see an ordered list of most erroneous automatic transcriptions. Analysis from there could guide us towards automatic checking.

shuttle1987 commented 6 years ago

While this is out of scope for my current chunk of work it does remind me of some of the text editor development work I was engaging in last year, there's some interesting potential to highlight sections of the text based on various filters directly in the GUI. You can have set thresholds for triggering alternative display of data or a more continuous approach that allows you to highlight based directly on the probability of error in much the same way heat maps can help visualize data. How this would best work with the workflow of a linguist is something that would be good to discuss.