Implement search mode (error detection in running text), aka context-aware correction

proycon commented 3 years ago

Search mode is the error detection stage of analiticcl, where running text is passed as input, and analiticcl attempts to identify and correct parts of the text that require correction, regardless of where in the text that is or whether it concerns single words or larger n-grams (i.e. run-on and split errors should be supported). This is the stage where context plays an important role (unlike in analiticcl's simpler query mode)

A first version of a search mode has been implemented as follows:

Given an input sequence, extract n-grams up to a certain order (for now simply 2)
Pass all ngrams to query mode for variant lookup
Find the top 250 or so sequence segmentations that maximise the variant scores (no context considered yet)
Score these sequences with a simple language model
Compute a weighted mean between the language model score and variant model score, return the best sequence

However, this still requires extensive testing and tweaking, as the current results are still sub-optimal.

proycon commented 3 years ago

Current implementation

This proves to be quite a difficult problem and the implementation I now have still delivers poor quality. Let me describe my current implementation for future reference (i.e. so I don't forget it myself). I had hoped for a solution inspired on phrase-based statistical machine translation, where I combine a variant model and a language model and attempt to optimize their joint probability. All possible segmentations of an input sentence are represented in a Finite State Transducer such as the following for the simple input sentence "het is een huys" which should correct to "het is een huis":

analiticcl het_is_een_huys fst dot

The labels are in the form input:output (id)/score and the scores here are logprobs representing the emission scores from the variant model. The logprobs are > 0 as they are passed through abs() to reformulate a maximisation problem as a minimisation problem (needed for the underlying implementation). This simple example works nicely, heavy pruning already took place to discard many variants. The main aim of the FST solution is to solve the issue handling multiple possible (overlapping) segmentations and finding the best one, which is an inherent part of the problem as soon as we go from simple words to n-grams. The system would dynamically select either unigrams or n-grams, depending on which leads to the most optimal solution.

The top n best routes are extracted from the FST and then the language model comes in: each complete sequence is scored according to a simple bigram language model (with +1 smoothing). For each sequence we then obtain a logprob from the LM, along with a variant logprob (the sum of the variant scores of the path that was followed). The heart of the matter is that these two scores 'compete' with one another and their combination needs to be optimized. But their units are not directly comparable, so to remedy this I attempt a kind of normalisation where the top LM score gets cast to 1, and the top variant sum also gets cast to 1. The worst scores get cast to 0, so everything ends up in the range 1.0 (best) - 0.0 (worst). This may still be too arbitrary but it's a start.

The final score is a simple weighted linear combination (LM weight and variant model weight are parameters that by default are both set to 1).

Ranked sequences for this example:

  (#1, score=1, lm_logprob=-39.121334 (norm=1, * 1), variant_logprob=-0.56381285 (norm=1, * 1)
    (text=het | is | een huis | )
  (#2, score=0.8005231697097598, lm_logprob=-39.121334 (norm=1, * 1), variant_logprob=-0.79779273 (norm=0.6010463394195196, * 1)
    (text=het | is | een | huis | )
  (#3, score=0.7036817883037577, lm_logprob=-39.121334 (norm=1, * 1), variant_logprob=-0.9347367 (norm=0.40736357660751543, * 1)
    (text=het is | een huis | )
  (#4, score=0.6042895088728826, lm_logprob=-39.121334 (norm=1, * 1), variant_logprob=-1.097997 (norm=0.20857901774576515, * 1)
    (text=het | is een | huis | )
  (#5, score=0.5660235346245384, lm_logprob=-39.121334 (norm=1, * 1), variant_logprob=-1.1687167 (norm=0.13204706924907683, * 1)
    (text=het is | een | huis | )
  (#6, score=0.5547448040105464, lm_logprob=-45.22972 (norm=0.5475822439937876, * 1), variant_logprob=-0.8239951 (norm=0.5619073640273051, * 1)
    (text=het | is | een | huls | )
  (#7, score=0.5088897933617237, lm_logprob=-40.976673 (norm=0.8625842909642008, * 1), variant_logprob=-1.146794 (norm=0.15519529575924662, * 1)
    (text=het | IS een | huis | )
  (#8, score=0.494419030242586, lm_logprob=-46.7178 (norm=0.43736764973044695, * 1), variant_logprob=-0.8310999 (norm=0.551470410754725, * 1)
    (text=het | is | een | hups | )
  (#9, score=0.49353749309818384, lm_logprob=-44.827477 (norm=0.5773744927426494, * 1), variant_logprob=-0.9329675 (norm=0.4097004934537183, * 1)
    (text=het | is | een | Huys . | )
  (#10, score=0.3635861978357696, lm_logprob=-45.22972 (norm=0.5475822439937876, * 1), variant_logprob=-1.1241993 (norm=0.1795901516777516, * 1)
    (text=het | is een | huls | )

proycon commented 3 years ago

Problems

Issues I've thus-far with this approach are:

Striking a balance between language model and variant model so one doesn't dominate the other. This is a delicate problem for which a perfect solution doesn't exist anyway, but right now the balance feels off.
The algorithm and analiticcl in general is heavily parametrised, finding (locally) optimal parameters is a hard problem, ideally parameters should be tuned on some kind of development set, but that's often not available.
Can we make a sound comparison between the variant logprob of two unigrams vs one bigram? See the following FST where the input is "hij antwoort wat onzeker", the n-gram route over "hij antwoord" wins here over the two unigrams, but the n-gram model should favour "hij antwoordt" (the correct correction) over "hij antwoord".

analiticcl Hij_antwoort_wat_onzeker fst dot

  (#1, score=0.9489898049907576, lm_logprob=-13.911423 (norm=1, * 1), variant_logprob=-0.74414515 (norm=0.8979796099815152, * 1)
    (text=hij | antwoord wat | onzeker | )
  (#2, score=0.5897099496489611, lm_logprob=-13.911423 (norm=1, * 1), variant_logprob=-0.90142846 (norm=0.17941989929792213, * 1)
    (text=hij antwoord | wat onzeker | )
  (#3, score=0.5, lm_logprob=-50.058037 (norm=0, * 1), variant_logprob=-0.7236924 (norm=1, * 1)
    (text=hij | antwoordt | wat onzeker | )
  (#4, score=0.5, lm_logprob=-13.911423 (norm=1, * 1), variant_logprob=-0.94488895 (norm=0, * 1)
    (text=hij antwoord | wat | onzeker | )
  (#5, score=0.39284045326154515, lm_logprob=-50.058037 (norm=0, * 1), variant_logprob=-0.7671529 (norm=0.7856809065230903, * 1)
    (text=hij | antwoordt | wat | onzeker | )

The actual issue here seems to be that the LM is actually steering us away from the right answer in this case because it really likes the bigram "antwoord wat". A stronger LM (trigrams?) may be an idea here. But the point I wanted to raise is about how to make these comparisons in each of the triangles and whether it's fair to compare the variant model scores of unigrams and n-grams as we do here.

As known from SMT, systems can have a tendency to prefer shorter sequences, which may lead to it preferring to replace two words with a single one. This may lead to some dropped words.
N-grams can be part of the language model, the variant model, or both. They need to be in the variant model if they are to be predicatable as variants (for instance for unigram -> bigram corrections). But having too many in the variant model slows things down and consumes more memory. If we have decent results for unigrams, do we need to bother looking up the n-grams at all?
This is an obvious one, but the quality of the LM and lexicons matter a lot, if there are 'contaminated' entries in the LM or lexicons used for variant matching, the system will propagate those mistakes to its output.

proycon commented 3 years ago

I've been experimenting with a revised scoring mechanism today to establish a better common ground for comparing the variant scores of unigrams and higher-order n-grams (point 3 from my previous comment).

The following FST demonstrates this experiment:

analiticcl Dat_ver_dient_hij_niet fst dot

The variant scores in this FST consist of a base component that is proportional to the number of tokens spanned (e.g. 1 for unigrams, 2 for bigrams etc). The values in the FST can be interpreted as costs that are to be minimized (shortest path). The fractional part behind the comma encodes the variant score (0,1.0) as it was, but inversely (1.0 - variant score). This means that if we have two unigrams with a perfect variant score (0 + 0), we get base score 1 + 1 for the combination, if these same two unigrams can be covered by a bigram, and this bigram also has a perfect variant score (0) and a base score 2, then we see that the scores are identical and we have established some common ground for comparison. In that ideal case we'd have a tie and only the LM can break it. In reality this rarely happens and one or the other route tends to be favoured.

I do seem to get some better results with this approach, but it feels fairly ad-hoc/contrived. There's also still the question of how to strike a balance between the language model and the variant model (point 1), I made some improvements in that regard too where I use perplexity instead of the LM logprob, and then map this perplexity to 1.0 (best) ~ 0.0, without forcing an actual lower bound like I did before (worst perplexity found no longer gets mapped to 0, nor is the worst variant score mapped to 0 any longer). The actual score computation that expresses both LM and variant model is a weighted geometric mean.

proycon commented 3 years ago

At the end of July, I also implemented an alternative approach (enabled using --context-weight) to consider local context information: rather than applying the language model over the top-x full candidate solutions; I tried to consider context at a much earlier stage and tug at the existing variant scores in a rescoring stage that considers input context. In this situation, unlike the earlier implementation, context is non-normalized input context (i.e. prior to any further handling by analiticcl), which is a disadvantage. The advantage of this approach is that the context score is computed on a local part and more manageable. The remaining task is just to look up the shortest path in the FST again, but this time context scores are already integrated.

Both approaches are currently available in the implementation, for testing.

proycon / analiticcl

Implement search mode (error detection in running text), aka context-aware correction #2

Current implementation

Problems