Open da1nerd opened 5 years ago
The tokens occurrences are out of order when suggested to two unrelated source tokens. Because the source tokens are different this cannot be solved by Alignment Relative Occurrence.
This is tricky because we can't know for certain if the suggestion is completely invalid, or just needs to be used elsewhere. Therefore, we need to increase/decrease the confidence by some arbitrary value, but if this is too strong it could have negative propagating effects.
A large distance exists between the target tokens within the target sentence. Because wordMAP by definition operates under the assumption of contiguous n-grams, we can accurately calculate the n-gram relative token distance.
T
that is greater than 0
.x
and y
.0
.x
and y
.Sample data:
T = 7
x = 2
y = 4
The distance between the tokens is abs(x-y) - 1
. We subtract one because two tokens next to each other have a distance of 0
.
d = abs(2-4) - 1 = 1
To normalize the above value we need the maximum distance within the sentence. This is easily calculated by performing the above calculation on the first and last positions abs(0-T-1)-1
or just T-2
.
D = 7 - 2 = 5
Finally, we are able to calculate the distance ratio d/D
.
r = 1 / 5 = 0.2
0
indicates the tokens are right next to each other.1
indicates the tokens are on opposite sides of the sentence.The above algorithm was implemented, however this didn't solve the problem, because wordMAP only supports contiguous tokens. So the algorithm is redundant (for now).
We still need to address the out of order word occurrences. Here's a better representation of the problem.
See how the word "God" is not suggested in order of occurrence.
We could enforce the order of occurrence when we do the final sorting of predictions. This will basically give order of occurrence a trump card. This would not however effect the overall score of the suggestion (a suggestion is composed of individual alignment predictions), so this shouldn't cause valid suggestions to be lost.
My one concern with this approach is do we want to enforce the order of occurrence in the predictions rather than finding some way to give it a weighted score, so that we are simply influencing the results instead of hitting it with a hammer?
Perhaps we could add a switch that allows turning on enforcing order of occurrence instead of hardcoding it.
After some tinkering, I've determined WordMAP is actually working as expected.
The example problem above occurs with the alignment memory Θεὸς
=the God
. Because alignment memory automatically gets the highest prediction score we are forcing the out of order use of God
. But if, for example the memory was simply Θεὸς
=God
we see everything in order.
Alignment memory has a compounding effect, so if we had a lot of alignment memory but the overall weight was bent towards Θεὸς
=God
we'll get expected results. If however the overall weight was bent towards Θεὸς
=the God
, we get the "bug" above.
This isn't a bug at all, but the nature of WordMAP, and the results are influenced by the inputted alignment memory. The only way to fix this would be to take away the trump card given to alignment memory. Perhaps a user configurable weight could be introduced to dampen the power of alignment memory and allow the machine predictions to have an effect.
@PhotoNomad0 :point_up:
@neutrinog - maybe if I posted the old algorithm suggestions for comparison, it would be more obvious that there is a problem. I don't think there is alignment memory where Θεὸς=the God
. The old algorithm is doing a much better job on this verse:
Maybe there is a case where Θεὸς=the God
, will check the ~csv export~ alignments. Still seems that it would map to the most common usage.
OK, maybe this is the issue - I found three cases where ὁ and Θεὸς are combined (all the alignments made for Θεὸς). So shouldn't wordMap suggest they be combined?: {"topWords":[{"word":"ὁ","strong":"G35880","lemma":"ὁ","morph":"Gr,EA,,,,NMS,","occurrence":1,"occurrences":1},{"word":"Θεὸς","strong":"G23160","lemma":"θεός","morph":"Gr,N,,,,,NMS,","occurrence":1,"occurrences":1}],"bottomWords":[{"word":"God","occurrence":1,"occurrences":1,"type":"bottomWord"}]}
So the alignment memory should be ὁ Θεὸς=the God
There is also a case where ὁ Θεὸς=God
in 12:26 (the current verse).
Summary: @neutrinog found an instance where Θεὸς
is aligned to the God
so it is a valid suggestion, but the old algorithm did better.
I think we could deal with the out of order occurrences after the predictions are generated and when the engine moves into building the suggestion. At that point we could insert certain rules like keeping the word occurrences in order.
To illustrate here's some handy ASCII art:
[input]->[generate index]->[run prediction algorithms]->[generate suggestion]
The last step above is where we'd enforce the order of occurrence. Previously I had been trying to do so in the algorithms which wasn't working.
This is the issue I'm running into with the current AlignmentPosition
algorithm. This isn't really meant to make sense to anyone but me, but basically because of how the numbers are distributed, the closest pair of numbers are out of order.
I ended up solving this with what may not be the most elegant solution, but it works for now and the performance hit isn't noticeable at the moment. After scoring all of the predictions the engine will selectively build out a suggestion. During this process it will monitor word occurrences, and discard any suggestions that produce anything out of order.
In most situations this should complete within a reasonable amount of time. However, it's theoretically possible this could add a lot of time to prediction.
word occurrence order has received a lot of attention in https://github.com/unfoldingword/translationcore/issues/6237. This issue is redundant/irrelevant now.
This is different from the issue fixed here https://github.com/unfoldingWord/wordMAP/pull/49.
In this case the source tokens are not similar.
EDIT: see better screenshot here.