Pynini integration - Githubissues

Hello, I'm the maintainer for the Montreal Forced Aligner (MFA) and currently working on a new Japanese model for speech-to-text alignment. My current prototype uses sudachipy to generate morphemes, post-process these to create phonological words (e.g., "しちゃって” -> "しちゃって”), and then do the rest of the forced alignment pipeline as if this generated transcript was ground truth accurate (i.e., generate utterance FSTs for phone sequences from pronunciation dictionary look up).

Given that a morphological parser has its own lattice that the best path is extracted from, it'd be nice use the lattice as the starting point, compose it with an FST that does post processing for phonological words, and compose that with the dictionary. The latest versions of sudachipy don't return lattices or expose any internal methods to Python, so I'm still looking for a permanent solution.

For all of its FSTs, MFA uses pynini, which are Python bindings for OpenFst (like here). I saw that janome has a pure python implementation for FSTs, and I was curious if there's interest in adding or migrating that to a pynini implementation, which should simplify it a lot and allow for MFA to directly use any lattices.

If there is interest, I'm happy to put together an initial PR for it!

mocobeta / janome

Pynini integration #121