kylebgorman / pynini

Read-only mirror of Pynini
http://pynini.opengrm.org
Apache License 2.0
118 stars 27 forks source link

Pairs Language Model #61

Closed david-waterworth closed 1 year ago

david-waterworth commented 1 year ago

Hi Kyle

I'm wondering if you could provide me with a bit of guidance on to apply a pairs/joint language model, I'm using your paper "Structured Abbreviation Expansion in Context" for inspiration.

Based on Roark et. al 2014, I think what I should do is train a conventional n-gram model on the pairs, i.e. treat each pair in b:b, r:r, ε:e, ε:a, d:d as separate symbols? I've done this, it looks like what Roark then did was to then convert all the arcs from the pairs FST by splitting the original symbol (i.e. ε:e) and assigning ε to the input and e to the output of an arc that replaces the original - is this correct? If so, what's the best way to do this in pynini - I'm assuming you need to create a new FST containing all the original nodes, but re-generate all the arcs?

And then, given that the pairs model results in a joint distribution P(ai, ei) and the requirement is a conditional distribution P(ai | ei) I'm assuming you compute P(ai, ei) / P(ei). My initial thoughts are P(ei) is basically a unigram model over the expanded vocabulary - if I created an FST containing each word in the vocab with weight equals to -ve log of its unigram probability then applything this after the joint probability model would give the conditional probability - am I on the correct track?

kylebgorman commented 1 year ago

Yes, you align the strings (using epsilon on one side for insertion or deletion), encode the pair machines as acceptors (this is a lossless representation: each input/output pair of labels is a new acceptor label), train an LM off the encoded pair machines, and then decode the LM back to a transducer. We have a full implementation here:

https://github.com/google-research/google-research/tree/master/pair_ngram

This is then a joint distribution. But I actually don't recommend converting to the conditional distribution; it's extremely hard to implement and I've found it performance-negative. (And we don't have an implementation to give you here.)