persephone-tools / persephone

A tool for automatic phoneme transcription
Apache License 2.0
156 stars 26 forks source link

Facilitate use of acoustic model with other tools? #156

Closed alexis-michaud closed 6 years ago

alexis-michaud commented 6 years ago

A novice question: How could the acoustic models created with Persephone be exported to other software?

For instance, what kind of software development would it take to be able to export a Na acoustic model to a tool that does forced alignment of transcriptions to audio?

This is something I'd like to to with Na data, for phonetic studies. Doing forced alignment to obtain fairly large corpora aligned at the phoneme level allows for phonetic studies that I find very interesting, for example this one. I'm a newcomer to this field, in which Frank Seifart & others are very active.

There are various tools to do this. Clearly, what I'd like to have is a language-specific, speaker-specific tool, for highest precision of phoneme-level alignment.

The MAUS team (LMU, Munich) kindly explained:

You'll need a LINUX system and download the MAUS software package from the BAS software server and install it (and helper software) on your local system; there is no webservice for the iterative MAUS method. Then you'll need to define the phoneme set of your language and clone starting HMM for each phoneme; these can be taken from any language, e.g. English. Then you start the iterative MAUS process maus.iter with these initial HMMs and run it until no change regarding the time alignment is seen. The outcome of this process is then an improved time-alignment of the phonetic transcript to your dataset, and a adapted HMM set for future usage (on the same data type e.g. speaker).

I should try this with the Na data, but it requires tasks which, although computationally trivial, are time-consuming for me (preprocessing, converting from IPA to SAMPA, finding about the limitations of the tool & dealing with them...) and since the Persephone / CTC model performs so well on Na data, I was wondering: Would there be a way to export it to MAUS, or to McGill's ProsodyLab Aligner, rather than re-train a different model with a different tool (and maybe with a less wonderful level of accuracy)?

alexis-michaud commented 6 years ago

A related issue is: could a collection of acoustic models (with a versioning system) be made available from the Persephone repository?

I remember being told (about 5 years ago) by a colleague doing Natural Language Processing that some companies doing software that requires ASR were looking for acoustic models that they could use (without having to go through the expensive tasks of creating language resources to train models). So maybe a 'gallery' of acoustic models generated by Persephone (with a versioning system) would find users? The models won't be those for 'big' languages (most in demand for commercial applications), but could find users nonetheless: developers who aim to provide a truly language-independent acoustic model need to cover the entire International Phonetic Alphabet. It seems that they are not quite there yet, for instance the MAUS aligner in 'IPA' mode does not cover clicks & some other sounds. So they might be interested in existing acoustic models that allow them to extend their coverage of the IPA.

I'm not too hopeful about the precision that can be reached in off-the-shelf 'language-independent' mode, because patterns of articulation and coarticulation are known to differ from language to language: a good acoustic model is a language-specific (& dialect-specific) acoustic model. But a possibility would be to have a language-independent model that could adjust to a dialect & a speaker on the basis of just a few minutes or even seconds of speech, in the same way as humans adjust to the new pronunciations of speakers from other dialects in a matter of minutes: almost 'real-time'. Seen in this perspective, trying to put together some big acoustic model covering all the sounds of the world's languages could make sense, not? And acoustic models generated by Persephone could serve as a stepping-stone.

oadams commented 6 years ago

Hi Alexis, thanks for the question.

Regarding your original post, the model Persephone uses is fundamentally different to the type a tool such as MAUS uses, so it won't be possible to export the model to those tools.

There is the possibility of developing a technique to get time alignments from Persephone's models. I'm not sure precise it would be (hopefully good enough) and while it's something I'd like to do, it's not really something I have time to explore at the moment unfortunately :(. The best course of action for now is probably to use the transcripts and feed them into existing forced aligners.

On the other hand, something I can allocate more focus to is what you are talking about in your second post: multilingual models that one can adapt to unseen languages. I'm going to be exploring this in my postdoc work. In doing so I will endeavor to make any such models available online for use by others.

alexis-michaud commented 6 years ago

Thanks for the info. I'll try MAUS as well as the Prosodylab-Aligner (from McGill), then. MAUS requires Linux and involves various tasks that I'm not good at 🐢, so I may wait until NIE Ning (Felix), the engineering student who has volunteered for an internship at LACITO, comes over to work with us (late Autumn).

Great to hear about your postdoc plans!