Closed david-waterworth closed 2 years ago
The LM doesn't assign any probability to
OpenGrm-NGram generates an
On Fri, Aug 5, 2022 at 3:00 AM David Waterworth @.***> wrote:
I'm trying to adapt the ChatSpeak model, I noticed though that it fails on out-of-vocab tokens. There an issue with the way the model builds the lattice by processing each token in turn
it = iter(sentence.split()) token = next(it) lattice = self.token_lattice(token) for token in it: lattice.concat(" ") lattice.concat(self.token_lattice(token))
The issue is that if token is oov then self.token_lattice returns an empty pynini.Fst(). The bytes_to_lm_mapper is defined as expr (sep expr)^* where expr is lm_mapper and sep is
. So the code above can generate a lattice that isn't accepted by this pattern if any one token cannot be expanded. I can easily work around this by modifying the code above. What I thought about trying but I've so far been unsuccessful at is using the language models
token. The language models symbols contain and this code generates what I think should be valid as far as the lm is concerned - it produces an fst with two nodes and one arc containing the correct id for the token lattice = rewrite.rewrite_lattice('
', bytes_to_lm_mapper) But the next step fails with a composition error
lattice = rewrite.rewrite_lattice(lattice, lm)
I'm not sure if this is something I can solve with pynini or have I not trained the language model properly (i.e. to assign a non-zero probability to the
token? Any suggestions appreciated. — Reply to this email directly, view it on GitHub https://github.com/kylebgorman/pynini/issues/56, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABG4ON7ZJMDCL22QRZ27HTVXS3YVANCNFSM55VAQH7A . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Thanks, I thought perhaps smoothing might assign a low probability - I'll give your suggestions a go.
I also noticed that despite the OpenGrm-NGram saying the default is \<unk> it was actually \<UNK> but I've been unable to active my account on their forum to raise this.
I'm trying to adapt the ChatSpeak model, I noticed though that it fails on out-of-vocab tokens. There an issue with the way the model builds the lattice by processing each token in turn
The issue is that if
token
is oov thenself.token_lattice
returns an emptypynini.Fst()
. Thebytes_to_lm_mapper
is defined asexpr (sep expr)^*
whereexpr
islm_mapper
and sep is<space>
. So the code above can generate a lattice that isn't accepted by this pattern if any one token cannot be expanded. I can easily work around this by modifying the code above. (in fact I think if you concat anything topynini.Fst()
the result inpynini.Fst()
)What I thought about trying but I've so far been unsuccessful at is using the language models
<unk>
token. The language models symbols contain<unk>
and this code generates what I think should be valid as far as the lm is concerned - it produces an fst with two nodes and one arc containing the correct id for the<unk>
tokenBut the next step fails with a composition error
I'm not sure if this is something I can solve with
pynini
or have I not trained the language model properly (i.e. to assign a non-zero probability to the<unk>
token? Any suggestions appreciated.