Composition failure in wikipron-modeling examples

mmcauliffe commented 3 years ago

Hi Kyle,

Been playing around with getting a Pynini based G2P architecture working for MFA based on https://github.com/kylebgorman/wikipron-modeling. I'm running into an issue with the code here: https://github.com/kylebgorman/wikipron-modeling/blob/master/fst/predict.py#L40, where it's reporting composition failure. I've to correct some instances of 2.1.0 vs 2.1.2 issues (cross instead of transducer, one instead of One, so I'm wondering if that's the issue here as well, though I don't see anything particularly relevant in the NEWS.MD, other than acceptor being renamed to accep, but accep isn't found while acceptor is.

It might also be a case of data sparsity, since this is just a toy lexicon that I'm training on that's about 80 words. I can attach the relevant files and any intermediate files if that's helpful too.

Thanks for any insight you can provide!

kylebgorman commented 3 years ago

Hi Michael,

accep -> acceptor somehow didn't make it into the 2.1.2 release (so that's an error in the NEWS) but it will be in the (very soon) 2.1.3 release. I think I also have corrected the news in the 2.1.3 release candidate.
Composition failure in that system is, I think, merely an indication of an OOV unigram. I get a lot on Korean because there are 400-odd hangul characters. If you get failure and you think all the unigrams should be in the training data, though, let me know.
We have a slightly-updated version of that FST model here. There's a conda environment file in the parent directory if you want to pin against exact versions of things (though it's a few versions back because it was developed in late 2019). Happy to support if you have troubles.

K

mmcauliffe commented 3 years ago

Awesome, thanks for this. I don't think that it's caused by an OOV item, since I did a test where the training used the full dictionary and the validation took random items from there and it still had that same behavior. They all have the same composition failure issue, so seems unlikely that it would be caused by a single character.

I'll play around with the updated version soon to see if I can get a working baseline, but in case you want to take a look at the code I've been working with, uploaded the very hacky code to https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner/tree/v2.0/montreal_forced_aligner/g2p with the dictionary here: https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner/blob/v2.0/tests/data/dictionaries/sick.txt, run via pytest tests/test_g2p.py -x if you want to play around with it in the meantime. The dictionary there has some weird idiosyncrasies with a '{cg}' word, but I don't think that's what's causing it to fail for other words.

mmcauliffe commented 3 years ago

Ok so it turns out that the composition failures were my own fault, the FST I was loading was generated originally using Phonetisaurus. I was hoping to support using previously trained models, do you have any ideas for what changes to FSTs generated by Phonetisaurus would be needed to work under Pynini?

Also, as part of updating the example for newer versions, I ran into an issue with baumwelchtrain where in 0.3.4, it's missing a lot of arguments that the example supplies (--expectation_table=ilabel --seed={seed} --remove_zero_arcs=false --flat_start=false --random_starts=1). I'm assuming these have become baked in and so it's ok to remove them moving forward, but just want to double check that.

kylebgorman commented 3 years ago

On Sun, Nov 22, 2020 at 7:03 AM Michael McAuliffe notifications@github.com wrote:

Ok so it turns out that the composition failures were my own fault, the FST I was loading was generated originally using Phonetisaurus. I was hoping to support using previously trained models, do you have any ideas for what changes to FSTs generated by Phonetisaurus would be needed to work under Pynini?

I don't know: don't they use OpenFst as a backend? If so there's some simple homomorphism from their OpenFst FSTs to the kind expected by the Pynini setup. I'm using what I think the trivial one is: they must be doing something fancy and you'd have to read their docs and/or look at the topology to figure out.

Also, as part of updating the example for newer versions, I ran into an issue with baumwelchtrain where in 0.3.4, it's missing a lot of arguments that the example supplies (--expectation_table=ilabel --seed={seed} --remove_zero_arcs=false --flat_start=false --random_starts=1). I'm assuming these have become baked in and so it's ok to remove them moving forward, but just want to double check that.

In the last baumwelch release a lot of code was deleted,

the randomization feature was moved into a separate binary called baumwelchrandomize
removing zero arcs turned out to have no measurable effect so that was deleted
the set of expectation tables was limited to state and state_ilabel and the flag structure was changed slightly
instead of using --decipherment it now infers that from whether the first argument is an FST (that's the decipherment case) or a FAR (that's the pair case).

I think most of these things should be covered in the included README, but if you see something out of date let me know or send a PR.

You're welcome also to use the old pinned version though: there's absolutely nothing wrong with it.

K

mmcauliffe commented 3 years ago

Yeah, I might stick to the current set up since it's been tested, though I certainly have a bias towards trying to get the latest versions working. It does look like ngram is having some issues with the latest openfst (expecting a different .so version from 1.7.X vs 1.8.0).

For Phonetisaurus, they do use openfst, but the trained models have slightly different output for fstinfo:

fst type                                          vector
arc type                                          standard
input symbol table                                isyms
output symbol table                               osyms
# of states                                       1109844
# of arcs                                         2509839
initial state                                     0
# of final states                                 393489
# of input/output epsilons                        1109843
# of input epsilons                               1109843
# of output epsilons                              1109843
input label multiplicity                          1.18499
output label multiplicity                         1.11522
# of accessible states                            1109844
# of coaccessible states                          1109844
# of connected states                             1109844
# of connected components                         1
# of strongly conn components                     205463
input matcher                                     n
output matcher                                    n
input lookahead                                   n
output lookahead                                  n
expanded                                          y
mutable                                           y
error                                             n
acceptor                                          n
input deterministic                               n
output deterministic                              n
input/output epsilons                             y
input epsilons                                    y
output epsilons                                   y
input label sorted                                n
output label sorted                               n
weighted                                          y
cyclic                                            y
cyclic at initial state                           n
top sorted                                        n
accessible                                        y
coaccessible                                      y
string                                            n
weighted cycles                                   y

compared to the toy FST trained for pynini:

fst type                                          vector
arc type                                          standard
input symbol table                                none
output symbol table                               none
# of states                                       1195
# of arcs                                         2461
initial state                                     1
# of final states                                 356
# of input/output epsilons                        1194
# of input epsilons                               1252
# of output epsilons                              1539
input label multiplicity                          1.17391
output label multiplicity                         1.40065
# of accessible states                            1195
# of coaccessible states                          1195
# of connected states                             1195
# of connected components                         1
# of strongly conn components                     279
input matcher                                     n
output matcher                                    n
input lookahead                                   n
output lookahead                                  n
expanded                                          y
mutable                                           y
error                                             n
acceptor                                          n
input deterministic                               n
output deterministic                              n
input/output epsilons                             y
input epsilons                                    y
output epsilons                                   y
input label sorted                                n
output label sorted                               n
weighted                                          y
cyclic                                            y
cyclic at initial state                           n
top sorted                                        n
accessible                                        y
coaccessible                                      y
string                                            n
weighted cycles                                   y

So the main differences are in the input/output symbol tables and in the initial state (which I think is what was throwing the error).

kylebgorman commented 3 years ago

Okay so that tells me they're using symbol tables for both inputs and outputs, whereas the output of my procedure is just an FST where each arc is a byte. To make things compatible you'd want to take their symbol table and create an automaton such that it maps each input symbol label in their symbol table to the corresponding string---this is pretty easy with pynini.string_map---then take the closure. Then you'll want to do the same thing for the output side. Then if you compose

yourinputtransducer @ phonetisaurus model @ youroutputtransducer

and optimize, you should have something byte-based that is roughly compatible.

The start state fact isn't relevant exactly. Usually 0 is the start state, but there are various bookkeeping reasons why in the so-called "canonical n-gram topology" you want to use 0 as the absolute ("zero-gram") backoff state and 1 as the start state. Every possible reordering of states is an equivalent automaton though because the state IDs are nothing more than dense integer range unique IDs and have no further semantics.

On Sat, Nov 21, 2020 at 6:36 PM Michael McAuliffe notifications@github.com wrote:

Yeah, I might stick to the current set up since it's been tested, though I certainly have a bias towards trying to get the latest versions working. It does look like ngram is having some issues with the latest openfst (expecting a different .so version from 1.7.X vs 1.8.0).

For Phonetisaurus, they do use openfst, but the trained models have slightly different output for fstinfo:

fst type vector arc type standard input symbol table isyms output symbol table osyms

of states 1109844

of arcs 2509839

initial state 0

of final states 393489

of input/output epsilons 1109843

of input epsilons 1109843

of output epsilons 1109843

input label multiplicity 1.18499 output label multiplicity 1.11522

of accessible states 1109844

of coaccessible states 1109844

of connected states 1109844

of connected components 1

of strongly conn components 205463

input matcher n output matcher n input lookahead n output lookahead n expanded y mutable y error n acceptor n input deterministic n output deterministic n input/output epsilons y input epsilons y output epsilons y input label sorted n output label sorted n weighted y cyclic y cyclic at initial state n top sorted n accessible y coaccessible y string n weighted cycles y

compared to the toy FST trained for pynini:

fst type vector arc type standard input symbol table none output symbol table none

of states 1195

of arcs 2461

initial state 1

of final states 356

of input/output epsilons 1194

of input epsilons 1252

of output epsilons 1539

input label multiplicity 1.17391 output label multiplicity 1.40065

of accessible states 1195

of coaccessible states 1195

of connected states 1195

of connected components 1

of strongly conn components 279

input matcher n output matcher n input lookahead n output lookahead n expanded y mutable y error n acceptor n input deterministic n output deterministic n input/output epsilons y input epsilons y output epsilons y input label sorted n output label sorted n weighted y cyclic y cyclic at initial state n top sorted n accessible y coaccessible y string n weighted cycles y

So the main differences are in the input/output symbol tables and in the initial state (which I think is what was throwing the error).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kylebgorman/pynini/issues/34#issuecomment-731653175, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABG4OPPIRTUR5RU63B5DU3SRBFHLANCNFSM4T22FZCA .

kylebgorman commented 3 years ago

PS: the new Pynini 2.1.3 should have a fixed and up-to-date NEWS.

mmcauliffe commented 3 years ago

So I've gotten it mostly working with pynini=2.1.0 openfst=1.7.6 ngram=1.3.9. I did some tests on a virtual machine Mac install, and it looks to be reading SymbolTables differently between my WSL Ubuntu install and the VM Mac.

Installed via conda create -n aligner -c conda-forge python=3.8 openfst=1.7.6 pynini=2.1.0 ngram=1.3.9 baumwelch=0.3.1 Using phones.sym

import pynini
path = "/path/to/phones.sym"
test_table = pynini.SymbolTable.read_text(path)

test_table.find('ae') # gives 20 on ubuntu, -1 on Mac
test_table.find(20) # gives 'ae' on ubuntu, 'ae' on Mac

Any ideas why it's not working? Is this something fixed in more recent versions?

kylebgorman commented 3 years ago

I am aware that symbol table reading on Macs is broken in 2.1.3 (didn't know it was broken earlier). This is hard to debug because we don't have CI resources for Macs, and I don't personally own a Mac either.

We're rewriting some of the internals in the hopes that it fixes something there, but we honestly don't know why it happens. It is hoped the next release (which is probably not far off: a few weeks?) will address this though. Would you be willing to test a release candidate, BTW, to test on your Mac?

On Mon, Feb 1, 2021 at 6:54 PM Michael McAuliffe notifications@github.com wrote:

So I've gotten it mostly working with pynini=2.1.0 openfst=1.7.6 ngram=1.3.9. I did some tests on a virtual machine Mac install, and it looks to be reading SymbolTables differently between my WSL Ubuntu install and the VM Mac.

Installed via conda create -n aligner -c conda-forge python=3.8 openfst=1.7.6 pynini=2.1.0 ngram=1.3.9 baumwelch=0.3.1 Using phones.sym https://github.com/kylebgorman/pynini/files/5907528/phones.sym.txt

import pynini path = "/path/to/phones.sym" test_table = pynini.SymbolTable.read_text(path)

test_table.find('ae') # gives 20 on ubuntu, -1 on Mac test_table.find(20) # gives 'ae' on ubuntu, 'ae' on Mac

Any ideas why it's not working? Is this something fixed in more recent versions?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kylebgorman/pynini/issues/34#issuecomment-771243078, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABG4OPOFEIRDA433DPXPHTS445MDANCNFSM4T22FZCA .

mmcauliffe commented 3 years ago

Sure, I can do that! Relatedly, do you know if there are any plans to support Windows natively for OpenFst/Ngram/Pynini? I know there's some work by the Kaldi people to have a Windows version of OpenFST, and I've hacked together some Visual Studio solutions for ngram in the past, and it wasn't too bad from what I recall, but it is a little out of my wheelhouse and I have no idea about Pynini. It'd just be nice to be able to support all platforms with conda installs.

kylebgorman commented 3 years ago

On Mon, Feb 1, 2021 at 7:06 PM Michael McAuliffe notifications@github.com wrote:

Sure, I can do that! Relatedly, do you know if there are any plans to support Windows natively for OpenFst/Ngram/Pynini? I know there's some work by the Kaldi people to have a Windows version of OpenFST, and I've hacked together some Visual Studio solutions for ngram in the past, and it wasn't too bad from what I recall, but it is a little out of my wheelhouse and I have no idea about Pynini. It'd just be nice to be able to support all platforms with conda installs.

Not interested in that for my own purposes, but if somebody else wants to hack it together I'm willing to review PRs and can enable Conda install elsewhere.

(I'm also trying to get a manylinuxes wheel working for Pynini but I don't really know how to use Docker.)

Neither Pynini nor OpenGrm-NGram should be significantly harder than OpenFst IMO. As it says in the OpenFst README: "It is expected to work wherever adequate POSIX (dlopen, ssize_t, basename), C99 (snprintf, strtoll, ), and C++17 (, ,

, constexpr-if) support is available." K >

kylebgorman / pynini

Composition failure in wikipron-modeling examples #34

of states 1109844

of arcs 2509839

of final states 393489

of input/output epsilons 1109843

of input epsilons 1109843

of output epsilons 1109843

of accessible states 1109844

of coaccessible states 1109844

of connected states 1109844

of connected components 1

of strongly conn components 205463

of states 1195

of arcs 2461

of final states 356

of input/output epsilons 1194

of input epsilons 1252

of output epsilons 1539

of accessible states 1195

of coaccessible states 1195

of connected states 1195

of connected components 1

of strongly conn components 279