Closed mmcauliffe closed 3 years ago
Hi Michael,
accep
-> acceptor
somehow didn't make it into the 2.1.2 release (so that's an error in the NEWS) but it will be in the (very soon) 2.1.3 release. I think I also have corrected the news in the 2.1.3 release candidate.K
Awesome, thanks for this. I don't think that it's caused by an OOV item, since I did a test where the training used the full dictionary and the validation took random items from there and it still had that same behavior. They all have the same composition failure issue, so seems unlikely that it would be caused by a single character.
I'll play around with the updated version soon to see if I can get a working baseline, but in case you want to take a look at the code I've been working with, uploaded the very hacky code to https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner/tree/v2.0/montreal_forced_aligner/g2p with the dictionary here: https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner/blob/v2.0/tests/data/dictionaries/sick.txt, run via pytest tests/test_g2p.py -x
if you want to play around with it in the meantime. The dictionary there has some weird idiosyncrasies with a '{cg}' word, but I don't think that's what's causing it to fail for other words.
Ok so it turns out that the composition failures were my own fault, the FST I was loading was generated originally using Phonetisaurus. I was hoping to support using previously trained models, do you have any ideas for what changes to FSTs generated by Phonetisaurus would be needed to work under Pynini?
Also, as part of updating the example for newer versions, I ran into an issue with baumwelchtrain where in 0.3.4, it's missing a lot of arguments that the example supplies (--expectation_table=ilabel --seed={seed} --remove_zero_arcs=false --flat_start=false --random_starts=1
). I'm assuming these have become baked in and so it's ok to remove them moving forward, but just want to double check that.
On Sun, Nov 22, 2020 at 7:03 AM Michael McAuliffe notifications@github.com wrote:
Ok so it turns out that the composition failures were my own fault, the FST I was loading was generated originally using Phonetisaurus. I was hoping to support using previously trained models, do you have any ideas for what changes to FSTs generated by Phonetisaurus would be needed to work under Pynini?
I don't know: don't they use OpenFst as a backend? If so there's some simple homomorphism from their OpenFst FSTs to the kind expected by the Pynini setup. I'm using what I think the trivial one is: they must be doing something fancy and you'd have to read their docs and/or look at the topology to figure out.
Also, as part of updating the example for newer versions, I ran into an issue with baumwelchtrain where in 0.3.4, it's missing a lot of arguments that the example supplies (--expectation_table=ilabel --seed={seed} --remove_zero_arcs=false --flat_start=false --random_starts=1). I'm assuming these have become baked in and so it's ok to remove them moving forward, but just want to double check that.
In the last baumwelch release a lot of code was deleted,
baumwelchrandomize
--decipherment
it now infers that from whether the
first argument is an FST (that's the decipherment case) or a FAR (that's
the pair case).I think most of these things should be covered in the included README, but if you see something out of date let me know or send a PR.
You're welcome also to use the old pinned version though: there's absolutely nothing wrong with it.
K
Yeah, I might stick to the current set up since it's been tested, though I certainly have a bias towards trying to get the latest versions working. It does look like ngram is having some issues with the latest openfst (expecting a different .so version from 1.7.X vs 1.8.0).
For Phonetisaurus, they do use openfst, but the trained models have slightly different output for fstinfo:
fst type vector
arc type standard
input symbol table isyms
output symbol table osyms
# of states 1109844
# of arcs 2509839
initial state 0
# of final states 393489
# of input/output epsilons 1109843
# of input epsilons 1109843
# of output epsilons 1109843
input label multiplicity 1.18499
output label multiplicity 1.11522
# of accessible states 1109844
# of coaccessible states 1109844
# of connected states 1109844
# of connected components 1
# of strongly conn components 205463
input matcher n
output matcher n
input lookahead n
output lookahead n
expanded y
mutable y
error n
acceptor n
input deterministic n
output deterministic n
input/output epsilons y
input epsilons y
output epsilons y
input label sorted n
output label sorted n
weighted y
cyclic y
cyclic at initial state n
top sorted n
accessible y
coaccessible y
string n
weighted cycles y
compared to the toy FST trained for pynini:
fst type vector
arc type standard
input symbol table none
output symbol table none
# of states 1195
# of arcs 2461
initial state 1
# of final states 356
# of input/output epsilons 1194
# of input epsilons 1252
# of output epsilons 1539
input label multiplicity 1.17391
output label multiplicity 1.40065
# of accessible states 1195
# of coaccessible states 1195
# of connected states 1195
# of connected components 1
# of strongly conn components 279
input matcher n
output matcher n
input lookahead n
output lookahead n
expanded y
mutable y
error n
acceptor n
input deterministic n
output deterministic n
input/output epsilons y
input epsilons y
output epsilons y
input label sorted n
output label sorted n
weighted y
cyclic y
cyclic at initial state n
top sorted n
accessible y
coaccessible y
string n
weighted cycles y
So the main differences are in the input/output symbol tables and in the initial state (which I think is what was throwing the error).
Okay so that tells me they're using symbol tables for both inputs and
outputs, whereas the output of my procedure is just an FST where each arc
is a byte. To make things compatible you'd want to take their symbol table
and create an automaton such that it maps each input symbol label in their
symbol table to the corresponding string---this is pretty easy with
pynini.string_map
---then take the closure. Then you'll want to do the
same thing for the output side. Then if you compose
yourinputtransducer @ phonetisaurus model @ youroutputtransducer
and optimize, you should have something byte-based that is roughly compatible.
The start state fact isn't relevant exactly. Usually 0 is the start state, but there are various bookkeeping reasons why in the so-called "canonical n-gram topology" you want to use 0 as the absolute ("zero-gram") backoff state and 1 as the start state. Every possible reordering of states is an equivalent automaton though because the state IDs are nothing more than dense integer range unique IDs and have no further semantics.
On Sat, Nov 21, 2020 at 6:36 PM Michael McAuliffe notifications@github.com wrote:
Yeah, I might stick to the current set up since it's been tested, though I certainly have a bias towards trying to get the latest versions working. It does look like ngram is having some issues with the latest openfst (expecting a different .so version from 1.7.X vs 1.8.0).
For Phonetisaurus, they do use openfst, but the trained models have slightly different output for fstinfo:
fst type vector arc type standard input symbol table isyms output symbol table osyms
of states 1109844
of arcs 2509839
initial state 0
of final states 393489
of input/output epsilons 1109843
of input epsilons 1109843
of output epsilons 1109843
input label multiplicity 1.18499 output label multiplicity 1.11522
of accessible states 1109844
of coaccessible states 1109844
of connected states 1109844
of connected components 1
of strongly conn components 205463
input matcher n output matcher n input lookahead n output lookahead n expanded y mutable y error n acceptor n input deterministic n output deterministic n input/output epsilons y input epsilons y output epsilons y input label sorted n output label sorted n weighted y cyclic y cyclic at initial state n top sorted n accessible y coaccessible y string n weighted cycles y
compared to the toy FST trained for pynini:
fst type vector arc type standard input symbol table none output symbol table none
of states 1195
of arcs 2461
initial state 1
of final states 356
of input/output epsilons 1194
of input epsilons 1252
of output epsilons 1539
input label multiplicity 1.17391 output label multiplicity 1.40065
of accessible states 1195
of coaccessible states 1195
of connected states 1195
of connected components 1
of strongly conn components 279
input matcher n output matcher n input lookahead n output lookahead n expanded y mutable y error n acceptor n input deterministic n output deterministic n input/output epsilons y input epsilons y output epsilons y input label sorted n output label sorted n weighted y cyclic y cyclic at initial state n top sorted n accessible y coaccessible y string n weighted cycles y
So the main differences are in the input/output symbol tables and in the initial state (which I think is what was throwing the error).
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kylebgorman/pynini/issues/34#issuecomment-731653175, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABG4OPPIRTUR5RU63B5DU3SRBFHLANCNFSM4T22FZCA .
PS: the new Pynini 2.1.3 should have a fixed and up-to-date NEWS
.
So I've gotten it mostly working with pynini=2.1.0 openfst=1.7.6 ngram=1.3.9
. I did some tests on a virtual machine Mac install, and it looks to be reading SymbolTables differently between my WSL Ubuntu install and the VM Mac.
Installed via conda create -n aligner -c conda-forge python=3.8 openfst=1.7.6 pynini=2.1.0 ngram=1.3.9 baumwelch=0.3.1
Using phones.sym
import pynini
path = "/path/to/phones.sym"
test_table = pynini.SymbolTable.read_text(path)
test_table.find('ae') # gives 20 on ubuntu, -1 on Mac
test_table.find(20) # gives 'ae' on ubuntu, 'ae' on Mac
Any ideas why it's not working? Is this something fixed in more recent versions?
I am aware that symbol table reading on Macs is broken in 2.1.3 (didn't know it was broken earlier). This is hard to debug because we don't have CI resources for Macs, and I don't personally own a Mac either.
We're rewriting some of the internals in the hopes that it fixes something there, but we honestly don't know why it happens. It is hoped the next release (which is probably not far off: a few weeks?) will address this though. Would you be willing to test a release candidate, BTW, to test on your Mac?
On Mon, Feb 1, 2021 at 6:54 PM Michael McAuliffe notifications@github.com wrote:
So I've gotten it mostly working with pynini=2.1.0 openfst=1.7.6 ngram=1.3.9. I did some tests on a virtual machine Mac install, and it looks to be reading SymbolTables differently between my WSL Ubuntu install and the VM Mac.
Installed via conda create -n aligner -c conda-forge python=3.8 openfst=1.7.6 pynini=2.1.0 ngram=1.3.9 baumwelch=0.3.1 Using phones.sym https://github.com/kylebgorman/pynini/files/5907528/phones.sym.txt
import pynini path = "/path/to/phones.sym" test_table = pynini.SymbolTable.read_text(path)
test_table.find('ae') # gives 20 on ubuntu, -1 on Mac test_table.find(20) # gives 'ae' on ubuntu, 'ae' on Mac
Any ideas why it's not working? Is this something fixed in more recent versions?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kylebgorman/pynini/issues/34#issuecomment-771243078, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABG4OPOFEIRDA433DPXPHTS445MDANCNFSM4T22FZCA .
Sure, I can do that! Relatedly, do you know if there are any plans to support Windows natively for OpenFst/Ngram/Pynini? I know there's some work by the Kaldi people to have a Windows version of OpenFST, and I've hacked together some Visual Studio solutions for ngram in the past, and it wasn't too bad from what I recall, but it is a little out of my wheelhouse and I have no idea about Pynini. It'd just be nice to be able to support all platforms with conda installs.
On Mon, Feb 1, 2021 at 7:06 PM Michael McAuliffe notifications@github.com wrote:
Sure, I can do that! Relatedly, do you know if there are any plans to support Windows natively for OpenFst/Ngram/Pynini? I know there's some work by the Kaldi people to have a Windows version of OpenFST, and I've hacked together some Visual Studio solutions for ngram in the past, and it wasn't too bad from what I recall, but it is a little out of my wheelhouse and I have no idea about Pynini. It'd just be nice to be able to support all platforms with conda installs.
Not interested in that for my own purposes, but if somebody else wants to hack it together I'm willing to review PRs and can enable Conda install elsewhere.
(I'm also trying to get a manylinuxes wheel working for Pynini but I don't really know how to use Docker.)
Neither Pynini nor OpenGrm-NGram should be significantly harder than
OpenFst IMO. As it says in the OpenFst README: "It is expected to work
wherever adequate POSIX (dlopen, ssize_t, basename), C99 (snprintf,
strtoll,
Hi Kyle,
Been playing around with getting a Pynini based G2P architecture working for MFA based on https://github.com/kylebgorman/wikipron-modeling. I'm running into an issue with the code here: https://github.com/kylebgorman/wikipron-modeling/blob/master/fst/predict.py#L40, where it's reporting composition failure. I've to correct some instances of 2.1.0 vs 2.1.2 issues (
cross
instead oftransducer
,one
instead ofOne
, so I'm wondering if that's the issue here as well, though I don't see anything particularly relevant in the NEWS.MD, other thanacceptor
being renamed toaccep
, butaccep
isn't found whileacceptor
is.It might also be a case of data sparsity, since this is just a toy lexicon that I'm training on that's about 80 words. I can attach the relevant files and any intermediate files if that's helpful too.
Thanks for any insight you can provide!