mocobeta / janome

Japanese morphological analysis engine written in pure Python
https://mocobeta.github.io/janome/en/
Apache License 2.0
847 stars 51 forks source link

Pynini integration #121

Open mmcauliffe opened 11 months ago

mmcauliffe commented 11 months ago

Hello, I'm the maintainer for the Montreal Forced Aligner (MFA) and currently working on a new Japanese model for speech-to-text alignment. My current prototype uses sudachipy to generate morphemes, post-process these to create phonological words (e.g., "し ちゃっ て” -> "しちゃって”), and then do the rest of the forced alignment pipeline as if this generated transcript was ground truth accurate (i.e., generate utterance FSTs for phone sequences from pronunciation dictionary look up).

Given that a morphological parser has its own lattice that the best path is extracted from, it'd be nice use the lattice as the starting point, compose it with an FST that does post processing for phonological words, and compose that with the dictionary. The latest versions of sudachipy don't return lattices or expose any internal methods to Python, so I'm still looking for a permanent solution.

For all of its FSTs, MFA uses pynini, which are Python bindings for OpenFst (like here). I saw that janome has a pure python implementation for FSTs, and I was curious if there's interest in adding or migrating that to a pynini implementation, which should simplify it a lot and allow for MFA to directly use any lattices.

If there is interest, I'm happy to put together an initial PR for it!

mocobeta commented 10 months ago

Hi @mmcauliffe, Sorry for the late reply. I've been too busy in recent days to be involved in this issue.

I'm not very familiar with the speech-to-text domain, but It sounds exciting! Janome has a "no-dependencies" policy for flexibility and future maintenance. I'm just curious - is it possible to re-implement Pynini in Janome? Or do you think it'd be better to have a fork (a variant that integrates Pynini for the string matching engine) of Janome for MFA?