Closed alexis-michaud closed 6 years ago
A related issue is: could a collection of acoustic models (with a versioning system) be made available from the Persephone repository?
I remember being told (about 5 years ago) by a colleague doing Natural Language Processing that some companies doing software that requires ASR were looking for acoustic models that they could use (without having to go through the expensive tasks of creating language resources to train models). So maybe a 'gallery' of acoustic models generated by Persephone (with a versioning system) would find users? The models won't be those for 'big' languages (most in demand for commercial applications), but could find users nonetheless: developers who aim to provide a truly language-independent acoustic model need to cover the entire International Phonetic Alphabet. It seems that they are not quite there yet, for instance the MAUS aligner in 'IPA' mode does not cover clicks & some other sounds. So they might be interested in existing acoustic models that allow them to extend their coverage of the IPA.
I'm not too hopeful about the precision that can be reached in off-the-shelf 'language-independent' mode, because patterns of articulation and coarticulation are known to differ from language to language: a good acoustic model is a language-specific (& dialect-specific) acoustic model. But a possibility would be to have a language-independent model that could adjust to a dialect & a speaker on the basis of just a few minutes or even seconds of speech, in the same way as humans adjust to the new pronunciations of speakers from other dialects in a matter of minutes: almost 'real-time'. Seen in this perspective, trying to put together some big acoustic model covering all the sounds of the world's languages could make sense, not? And acoustic models generated by Persephone could serve as a stepping-stone.
Hi Alexis, thanks for the question.
Regarding your original post, the model Persephone uses is fundamentally different to the type a tool such as MAUS uses, so it won't be possible to export the model to those tools.
There is the possibility of developing a technique to get time alignments from Persephone's models. I'm not sure precise it would be (hopefully good enough) and while it's something I'd like to do, it's not really something I have time to explore at the moment unfortunately :(. The best course of action for now is probably to use the transcripts and feed them into existing forced aligners.
On the other hand, something I can allocate more focus to is what you are talking about in your second post: multilingual models that one can adapt to unseen languages. I'm going to be exploring this in my postdoc work. In doing so I will endeavor to make any such models available online for use by others.
Thanks for the info. I'll try MAUS as well as the Prosodylab-Aligner (from McGill), then. MAUS requires Linux and involves various tasks that I'm not good at 🐢, so I may wait until NIE Ning (Felix), the engineering student who has volunteered for an internship at LACITO, comes over to work with us (late Autumn).
Great to hear about your postdoc plans!
A novice question: How could the acoustic models created with Persephone be exported to other software?
For instance, what kind of software development would it take to be able to export a Na acoustic model to a tool that does forced alignment of transcriptions to audio?
This is something I'd like to to with Na data, for phonetic studies. Doing forced alignment to obtain fairly large corpora aligned at the phoneme level allows for phonetic studies that I find very interesting, for example this one. I'm a newcomer to this field, in which Frank Seifart & others are very active.
There are various tools to do this. Clearly, what I'd like to have is a language-specific, speaker-specific tool, for highest precision of phoneme-level alignment.
The MAUS team (LMU, Munich) kindly explained:
I should try this with the Na data, but it requires tasks which, although computationally trivial, are time-consuming for me (preprocessing, converting from IPA to SAMPA, finding about the limitations of the tool & dealing with them...) and since the Persephone / CTC model performs so well on Na data, I was wondering: Would there be a way to export it to MAUS, or to McGill's ProsodyLab Aligner, rather than re-train a different model with a different tool (and maybe with a less wonderful level of accuracy)?