Open pirolen opened 1 year ago
I wonder if this is the right way to loading the confusables file
Yes, it is.
It would be brilliant to have an example about how the confusables list impacts the ranking of error candidates, resp. how the confusables penalty or promotion works (i.e. what do we gain by these).
Good and valid questions indeed. First, I must perhaps say that I don't think this confusable weighting functionality has really been used in practice yet, so there's no proper evaluation or anything. Though I implemented it, we never used it in the Golden Agents projects for which analiticcl was developed. I can tell, of course, how it is implemented:
After all variants are scored in the regular way using the distance metrics and possibly frequency information (a log linear combination of various components), an extra rescoring is performed if a confusable list is provided. This rescoring is meant to give slight bonuses or penalties to the scores whenever certain confusables occur (with a certain confusable weight). In the documentation I write about this:
Weights greater than 1.0 are being given preference in the score weighting, weights smaller than 1.0 imply a penalty. When multiple confusable patterns match, the products of their weights is taken. The final weight is applied to the whole candidate score, so weights should be values fairly close to 1.0 in order not to introduce too large bonuses/penalties.
It is a bit hard to predict how this plays out in actual use-cases, the challenge is always in tweaking the weights so there is a balance between the confusable weights and the weights in the main score function (of which these are not a part but applied after-the-fact to that score as a whole). The only way to find out is to experiment with it.
There is one relevant option which is not properly documented yet, there is a --early-confusables
parameters which, when set, causes analiticcl to rescore variants using the confusable list before pruning variants on things like score thresholds and max candidates. The default is to first prune the variant list and only then apply the confusable weighting, as that is more performant (far less candidates to consider), but the other way round would of course be better for accuracy!
Especially: what would happen in analiticcl (apart from the semantic heterogeneity), if confusables would be listed in the > alphabet file? Many thanks!
The confusable lists and weights are a more refined mechanism and can express various things that the alphabet can't (like context information, and variable weights), but it does introduce an extra level of complexity. The alphabet file is much more crude, but if your confusables are unambiguous enough to fit in there, then that might indeed be the preferred option. If it causes only more ambiguity though, then it's probably not a good idea.
There is one relevant option which is not properly documented yet, there is a --early-confusables parameters which, when set, causes analiticcl to rescore variants using the confusable list before pruning variants on things like score thresholds and max candidates.
I was trying to find this parameter for the Python objects but so far without success. Is it available?
Good point, I think it's not propagated to the Python binding yet. I'll add it.
This should now be fixed in v0.4.6, call model.set_confusables_before_pruning()
to enable the parameter.
Thanks!
I'd like to use this parameter to achieve e.g. the following.
Suppose we know a number of historical sound change patterns, e.g. жд
--> ж
.
So then if using this method, a
How should this be represented in the confusables file, e.g. similar as below?
=[aж]-[д]=[е] 1.1
or likely without the preceding (and the tailing) context, which are not generic enough? I am not sure about the score in the 2nd column either.
(sorry for the multiple edits)
... and is there a way to make patterns to behave symmetrically, and apply to the counterpart cases as well ?
I.e. to cover the 'vice versa' above, e.g. to get the pattern edit таже
into тажде
, or do I need to specify that separately, an addition instead of the deletion?
I guess I have found it out, so e.g. this works well: =[ж]-[д] 1.5 =[ж]+[д] 1.5
and the score depends on how the other scores are set, I guess. But 1.5 seems to return the desired lexemes well for my use case.
Thanks a lot for the implementation!
Great, I see you already figured it out! That indeed seems like the proper syntax, you indeed need both explicitly. It will give a higher score to variants that had жд and lost the д, and to variants that have ж and add д.. relative to the weighting of the other variants that do not exhibit such a pattern. Finding the proper score is always a bit trial and error, 1.5 might be a bit on the large side even as they'd best be fairly close to 1.0 in order not to have too big an influence.
I wonder if this is the right way to loading the confusables file:
It would be brilliant to have an example about how the confusables list impacts the ranking of error candidates, resp. how the confusables penalty or promotion works (i.e. what do we gain by these).
Especially: what would happen in analiticcl (apart from the semantic heterogeneity), if confusables would be listed in the alphabet file? Many thanks!