Closed proycon closed 2 years ago
This proves to be quite a difficult problem and the implementation I now have still delivers poor quality. Let me describe my current implementation for future reference (i.e. so I don't forget it myself). I had hoped for a solution inspired on phrase-based statistical machine translation, where I combine a variant model and a language model and attempt to optimize their joint probability. All possible segmentations of an input sentence are represented in a Finite State Transducer such as the following for the simple input sentence "het is een huys" which should correct to "het is een huis":
The labels are in the form input:output (id)/score
and the scores here are logprobs representing the emission scores from the variant model. The logprobs are > 0 as they are passed through abs() to reformulate a maximisation problem as a minimisation problem (needed for the underlying implementation). This simple example works nicely, heavy pruning already took place to discard many variants. The main aim of the FST solution is to solve the issue handling multiple possible (overlapping) segmentations and finding the best one, which is an inherent part of the problem as soon as we go from simple words to n-grams. The system would dynamically select either unigrams or n-grams, depending on which leads to the most optimal solution.
The top n best routes are extracted from the FST and then the language model comes in: each complete sequence is scored according to a simple bigram language model (with +1 smoothing). For each sequence we then obtain a logprob from the LM, along with a variant logprob (the sum of the variant scores of the path that was followed). The heart of the matter is that these two scores 'compete' with one another and their combination needs to be optimized. But their units are not directly comparable, so to remedy this I attempt a kind of normalisation where the top LM score gets cast to 1, and the top variant sum also gets cast to 1. The worst scores get cast to 0, so everything ends up in the range 1.0 (best) - 0.0 (worst). This may still be too arbitrary but it's a start.
The final score is a simple weighted linear combination (LM weight and variant model weight are parameters that by default are both set to 1).
Ranked sequences for this example:
(#1, score=1, lm_logprob=-39.121334 (norm=1, * 1), variant_logprob=-0.56381285 (norm=1, * 1)
(text=het | is | een huis | )
(#2, score=0.8005231697097598, lm_logprob=-39.121334 (norm=1, * 1), variant_logprob=-0.79779273 (norm=0.6010463394195196, * 1)
(text=het | is | een | huis | )
(#3, score=0.7036817883037577, lm_logprob=-39.121334 (norm=1, * 1), variant_logprob=-0.9347367 (norm=0.40736357660751543, * 1)
(text=het is | een huis | )
(#4, score=0.6042895088728826, lm_logprob=-39.121334 (norm=1, * 1), variant_logprob=-1.097997 (norm=0.20857901774576515, * 1)
(text=het | is een | huis | )
(#5, score=0.5660235346245384, lm_logprob=-39.121334 (norm=1, * 1), variant_logprob=-1.1687167 (norm=0.13204706924907683, * 1)
(text=het is | een | huis | )
(#6, score=0.5547448040105464, lm_logprob=-45.22972 (norm=0.5475822439937876, * 1), variant_logprob=-0.8239951 (norm=0.5619073640273051, * 1)
(text=het | is | een | huls | )
(#7, score=0.5088897933617237, lm_logprob=-40.976673 (norm=0.8625842909642008, * 1), variant_logprob=-1.146794 (norm=0.15519529575924662, * 1)
(text=het | IS een | huis | )
(#8, score=0.494419030242586, lm_logprob=-46.7178 (norm=0.43736764973044695, * 1), variant_logprob=-0.8310999 (norm=0.551470410754725, * 1)
(text=het | is | een | hups | )
(#9, score=0.49353749309818384, lm_logprob=-44.827477 (norm=0.5773744927426494, * 1), variant_logprob=-0.9329675 (norm=0.4097004934537183, * 1)
(text=het | is | een | Huys . | )
(#10, score=0.3635861978357696, lm_logprob=-45.22972 (norm=0.5475822439937876, * 1), variant_logprob=-1.1241993 (norm=0.1795901516777516, * 1)
(text=het | is een | huls | )
Issues I've thus-far with this approach are:
(#1, score=0.9489898049907576, lm_logprob=-13.911423 (norm=1, * 1), variant_logprob=-0.74414515 (norm=0.8979796099815152, * 1)
(text=hij | antwoord wat | onzeker | )
(#2, score=0.5897099496489611, lm_logprob=-13.911423 (norm=1, * 1), variant_logprob=-0.90142846 (norm=0.17941989929792213, * 1)
(text=hij antwoord | wat onzeker | )
(#3, score=0.5, lm_logprob=-50.058037 (norm=0, * 1), variant_logprob=-0.7236924 (norm=1, * 1)
(text=hij | antwoordt | wat onzeker | )
(#4, score=0.5, lm_logprob=-13.911423 (norm=1, * 1), variant_logprob=-0.94488895 (norm=0, * 1)
(text=hij antwoord | wat | onzeker | )
(#5, score=0.39284045326154515, lm_logprob=-50.058037 (norm=0, * 1), variant_logprob=-0.7671529 (norm=0.7856809065230903, * 1)
(text=hij | antwoordt | wat | onzeker | )
The actual issue here seems to be that the LM is actually steering us away from the right answer in this case because it really likes the bigram "antwoord wat". A stronger LM (trigrams?) may be an idea here. But the point I wanted to raise is about how to make these comparisons in each of the triangles and whether it's fair to compare the variant model scores of unigrams and n-grams as we do here.
I've been experimenting with a revised scoring mechanism today to establish a better common ground for comparing the variant scores of unigrams and higher-order n-grams (point 3 from my previous comment).
The following FST demonstrates this experiment:
The variant scores in this FST consist of a base component that is proportional to the number of tokens spanned (e.g. 1 for unigrams, 2 for bigrams etc). The values in the FST can be interpreted as costs that are to be minimized (shortest path). The fractional part behind the comma encodes the variant score (0,1.0) as it was, but inversely (1.0 - variant score). This means that if we have two unigrams with a perfect variant score (0 + 0), we get base score 1 + 1 for the combination, if these same two unigrams can be covered by a bigram, and this bigram also has a perfect variant score (0) and a base score 2, then we see that the scores are identical and we have established some common ground for comparison. In that ideal case we'd have a tie and only the LM can break it. In reality this rarely happens and one or the other route tends to be favoured.
I do seem to get some better results with this approach, but it feels fairly ad-hoc/contrived. There's also still the question of how to strike a balance between the language model and the variant model (point 1), I made some improvements in that regard too where I use perplexity instead of the LM logprob, and then map this perplexity to 1.0 (best) ~ 0.0, without forcing an actual lower bound like I did before (worst perplexity found no longer gets mapped to 0, nor is the worst variant score mapped to 0 any longer). The actual score computation that expresses both LM and variant model is a weighted geometric mean.
At the end of July, I also implemented an alternative approach (enabled using --context-weight
) to consider local context information: rather than applying the language model over the top-x full candidate solutions; I tried to consider context at a much earlier stage and tug at the existing variant scores in a rescoring stage that considers input context. In this situation, unlike the earlier implementation, context is non-normalized input context (i.e. prior to any further handling by analiticcl), which is a disadvantage. The advantage of this approach is that the context score is computed on a local part and more manageable. The remaining task is just to look up the shortest path in the FST again, but this time context scores are already integrated.
Both approaches are currently available in the implementation, for testing.
Search mode is the error detection stage of analiticcl, where running text is passed as input, and analiticcl attempts to identify and correct parts of the text that require correction, regardless of where in the text that is or whether it concerns single words or larger n-grams (i.e. run-on and split errors should be supported). This is the stage where context plays an important role (unlike in analiticcl's simpler query mode)
A first version of a search mode has been implemented as follows:
However, this still requires extensive testing and tweaking, as the current results are still sub-optimal.