kylebgorman / pynini

Read-only mirror of Pynini
http://pynini.opengrm.org
Apache License 2.0
122 stars 27 forks source link

Hanling OOV in ChatSpeak #56

Closed david-waterworth closed 2 years ago

david-waterworth commented 2 years ago

I'm trying to adapt the ChatSpeak model, I noticed though that it fails on out-of-vocab tokens. There an issue with the way the model builds the lattice by processing each token in turn

it = iter(sentence.split())
token = next(it)
lattice = self.token_lattice(token)
for token in it:
  lattice.concat(" ")
  lattice.concat(self.token_lattice(token))

The issue is that if token is oov then self.token_lattice returns an empty pynini.Fst(). The bytes_to_lm_mapper is defined as expr (sep expr)^* where expr is lm_mapper and sep is <space>. So the code above can generate a lattice that isn't accepted by this pattern if any one token cannot be expanded. I can easily work around this by modifying the code above. (in fact I think if you concat anything to pynini.Fst() the result in pynini.Fst() )

What I thought about trying but I've so far been unsuccessful at is using the language models <unk> token. The language models symbols contain <unk> and this code generates what I think should be valid as far as the lm is concerned - it produces an fst with two nodes and one arc containing the correct id for the <unk> token

lattice = rewrite.rewrite_lattice('<unk>', bytes_to_lm_mapper)

But the next step fails with a composition error

lattice = rewrite.rewrite_lattice(lattice, lm)

I'm not sure if this is something I can solve with pynini or have I not trained the language model properly (i.e. to assign a non-zero probability to the <unk> token? Any suggestions appreciated.

kylebgorman commented 2 years ago

The LM doesn't assign any probability to unless you "unk" the data ahead of time (i.e., replace very low frequency tokens with ), or if you go into the LM and insert an self-arc from the zerogram state (this will be the one state such that there is an epsilon/0 transition from the start state to it) with some small probability. I have done both at various times.

OpenGrm-NGram generates an symbol when you auto-generate a symbol table, but it doesn't add it to the LM itself automatically unless it occurs in the data.

On Fri, Aug 5, 2022 at 3:00 AM David Waterworth @.***> wrote:

I'm trying to adapt the ChatSpeak model, I noticed though that it fails on out-of-vocab tokens. There an issue with the way the model builds the lattice by processing each token in turn

it = iter(sentence.split()) token = next(it) lattice = self.token_lattice(token) for token in it: lattice.concat(" ") lattice.concat(self.token_lattice(token))

The issue is that if token is oov then self.token_lattice returns an empty pynini.Fst(). The bytes_to_lm_mapper is defined as expr (sep expr)^* where expr is lm_mapper and sep is . So the code above can generate a lattice that isn't accepted by this pattern if any one token cannot be expanded. I can easily work around this by modifying the code above.

What I thought about trying but I've so far been unsuccessful at is using the language models token. The language models symbols contain and this code generates what I think should be valid as far as the lm is concerned - it produces an fst with two nodes and one arc containing the correct id for the token

lattice = rewrite.rewrite_lattice('', bytes_to_lm_mapper)

But the next step fails with a composition error

lattice = rewrite.rewrite_lattice(lattice, lm)

I'm not sure if this is something I can solve with pynini or have I not trained the language model properly (i.e. to assign a non-zero probability to the token? Any suggestions appreciated.

— Reply to this email directly, view it on GitHub https://github.com/kylebgorman/pynini/issues/56, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABG4ON7ZJMDCL22QRZ27HTVXS3YVANCNFSM55VAQH7A . You are receiving this because you are subscribed to this thread.Message ID: @.***>

david-waterworth commented 2 years ago

Thanks, I thought perhaps smoothing might assign a low probability - I'll give your suggestions a go.

I also noticed that despite the OpenGrm-NGram saying the default is \<unk> it was actually \<UNK> but I've been unable to active my account on their forum to raise this.