proycon / analiticcl

an approximate string matching or fuzzy-matching system for spelling correction, normalisation or post-OCR correction
GNU General Public License v3.0
31 stars 4 forks source link

Loading the confusables file #19

Open pirolen opened 1 year ago

pirolen commented 1 year ago

I wonder if this is the right way to loading the confusables file:

m = build_variant_model(alphabet_file, weightsconfig=ws1)
m.read_confusablelist(confusables_file)

It would be brilliant to have an example about how the confusables list impacts the ranking of error candidates, resp. how the confusables penalty or promotion works (i.e. what do we gain by these).

Especially: what would happen in analiticcl (apart from the semantic heterogeneity), if confusables would be listed in the alphabet file? Many thanks!

proycon commented 1 year ago

I wonder if this is the right way to loading the confusables file

Yes, it is.

It would be brilliant to have an example about how the confusables list impacts the ranking of error candidates, resp. how the confusables penalty or promotion works (i.e. what do we gain by these).

Good and valid questions indeed. First, I must perhaps say that I don't think this confusable weighting functionality has really been used in practice yet, so there's no proper evaluation or anything. Though I implemented it, we never used it in the Golden Agents projects for which analiticcl was developed. I can tell, of course, how it is implemented:

After all variants are scored in the regular way using the distance metrics and possibly frequency information (a log linear combination of various components), an extra rescoring is performed if a confusable list is provided. This rescoring is meant to give slight bonuses or penalties to the scores whenever certain confusables occur (with a certain confusable weight). In the documentation I write about this:

Weights greater than 1.0 are being given preference in the score weighting, weights smaller than 1.0 imply a penalty. When multiple confusable patterns match, the products of their weights is taken. The final weight is applied to the whole candidate score, so weights should be values fairly close to 1.0 in order not to introduce too large bonuses/penalties.

It is a bit hard to predict how this plays out in actual use-cases, the challenge is always in tweaking the weights so there is a balance between the confusable weights and the weights in the main score function (of which these are not a part but applied after-the-fact to that score as a whole). The only way to find out is to experiment with it.

There is one relevant option which is not properly documented yet, there is a --early-confusables parameters which, when set, causes analiticcl to rescore variants using the confusable list before pruning variants on things like score thresholds and max candidates. The default is to first prune the variant list and only then apply the confusable weighting, as that is more performant (far less candidates to consider), but the other way round would of course be better for accuracy!

Especially: what would happen in analiticcl (apart from the semantic heterogeneity), if confusables would be listed in the > alphabet file? Many thanks!

The confusable lists and weights are a more refined mechanism and can express various things that the alphabet can't (like context information, and variable weights), but it does introduce an extra level of complexity. The alphabet file is much more crude, but if your confusables are unambiguous enough to fit in there, then that might indeed be the preferred option. If it causes only more ambiguity though, then it's probably not a good idea.

pirolen commented 6 months ago

There is one relevant option which is not properly documented yet, there is a --early-confusables parameters which, when set, causes analiticcl to rescore variants using the confusable list before pruning variants on things like score thresholds and max candidates.

I was trying to find this parameter for the Python objects but so far without success. Is it available?

proycon commented 6 months ago

Good point, I think it's not propagated to the Python binding yet. I'll add it.

proycon commented 6 months ago

This should now be fixed in v0.4.6, call model.set_confusables_before_pruning() to enable the parameter.

pirolen commented 6 months ago

Thanks!

I'd like to use this parameter to achieve e.g. the following. Suppose we know a number of historical sound change patterns, e.g. жд --> ж.

So then if using this method, a

How should this be represented in the confusables file, e.g. similar as below? =[aж]-[д]=[е] 1.1

or likely without the preceding (and the tailing) context, which are not generic enough? I am not sure about the score in the 2nd column either.

(sorry for the multiple edits)

pirolen commented 6 months ago

... and is there a way to make patterns to behave symmetrically, and apply to the counterpart cases as well ? I.e. to cover the 'vice versa' above, e.g. to get the pattern edit таже into тажде, or do I need to specify that separately, an addition instead of the deletion?

pirolen commented 6 months ago

I guess I have found it out, so e.g. this works well: =[ж]-[д] 1.5 =[ж]+[д] 1.5

and the score depends on how the other scores are set, I guess. But 1.5 seems to return the desired lexemes well for my use case.

Thanks a lot for the implementation!

proycon commented 6 months ago

Great, I see you already figured it out! That indeed seems like the proper syntax, you indeed need both explicitly. It will give a higher score to variants that had жд and lost the д, and to variants that have ж and add д.. relative to the weighting of the other variants that do not exhibit such a pattern. Finding the proper score is always a bit trial and error, 1.5 might be a bit on the large side even as they'd best be fairly close to 1.0 in order not to have too big an influence.