unfoldingWord / wordMAP

Multilingual Word Alignment Prediction
https://wordmap.netlify.com
6 stars 1 forks source link

word occurrence is suggested in the wrong order when aligned to dissimilar ngrams. #50

Open da1nerd opened 5 years ago

da1nerd commented 5 years ago

This is different from the issue fixed here https://github.com/unfoldingWord/wordMAP/pull/49.

In this case the source tokens are not similar.

EDIT: see better screenshot here.

Screen Shot 2019-10-21 at 8 49 14 AM

da1nerd commented 5 years ago

Out of order scenario

The tokens occurrences are out of order when suggested to two unrelated source tokens. Because the source tokens are different this cannot be solved by Alignment Relative Occurrence.

This is tricky because we can't know for certain if the suggestion is completely invalid, or just needs to be used elsewhere. Therefore, we need to increase/decrease the confidence by some arbitrary value, but if this is too strong it could have negative propagating effects.

Distance scenario

A large distance exists between the target tokens within the target sentence. Because wordMAP by definition operates under the assumption of contiguous n-grams, we can accurately calculate the n-gram relative token distance.

da1nerd commented 5 years ago

n-gram Relative Token Distance

Sample data:

T = 7
x = 2
y = 4

The distance between the tokens is abs(x-y) - 1. We subtract one because two tokens next to each other have a distance of 0.

d = abs(2-4) - 1 = 1

To normalize the above value we need the maximum distance within the sentence. This is easily calculated by performing the above calculation on the first and last positions abs(0-T-1)-1 or just T-2.

D = 7 - 2 = 5

Finally, we are able to calculate the distance ratio d/D.

r = 1 / 5 = 0.2

Interpretation

da1nerd commented 5 years ago

The above algorithm was implemented, however this didn't solve the problem, because wordMAP only supports contiguous tokens. So the algorithm is redundant (for now).

We still need to address the out of order word occurrences. Here's a better representation of the problem.

See how the word "God" is not suggested in order of occurrence. image

da1nerd commented 5 years ago

We could enforce the order of occurrence when we do the final sorting of predictions. This will basically give order of occurrence a trump card. This would not however effect the overall score of the suggestion (a suggestion is composed of individual alignment predictions), so this shouldn't cause valid suggestions to be lost.

My one concern with this approach is do we want to enforce the order of occurrence in the predictions rather than finding some way to give it a weighted score, so that we are simply influencing the results instead of hitting it with a hammer?

da1nerd commented 5 years ago

Perhaps we could add a switch that allows turning on enforcing order of occurrence instead of hardcoding it.

da1nerd commented 4 years ago

After some tinkering, I've determined WordMAP is actually working as expected. The example problem above occurs with the alignment memory Θεὸς=the God. Because alignment memory automatically gets the highest prediction score we are forcing the out of order use of God. But if, for example the memory was simply Θεὸς=God we see everything in order.

Alignment memory has a compounding effect, so if we had a lot of alignment memory but the overall weight was bent towards Θεὸς=God we'll get expected results. If however the overall weight was bent towards Θεὸς=the God, we get the "bug" above.

Conclusion

This isn't a bug at all, but the nature of WordMAP, and the results are influenced by the inputted alignment memory. The only way to fix this would be to take away the trump card given to alignment memory. Perhaps a user configurable weight could be introduced to dampen the power of alignment memory and allow the machine predictions to have an effect.

da1nerd commented 4 years ago

@PhotoNomad0 :point_up:

PhotoNomad0 commented 4 years ago

@neutrinog - maybe if I posted the old algorithm suggestions for comparison, it would be more obvious that there is a problem. I don't think there is alignment memory where Θεὸς=the God. The old algorithm is doing a much better job on this verse:

Screen Shot 2019-10-28 at 9 42 27 AM

PhotoNomad0 commented 4 years ago

Maybe there is a case where Θεὸς=the God, will check the ~csv export~ alignments. Still seems that it would map to the most common usage.

PhotoNomad0 commented 4 years ago

OK, maybe this is the issue - I found three cases where ὁ and Θεὸς are combined (all the alignments made for Θεὸς). So shouldn't wordMap suggest they be combined?: {"topWords":[{"word":"ὁ","strong":"G35880","lemma":"ὁ","morph":"Gr,EA,,,,NMS,","occurrence":1,"occurrences":1},{"word":"Θεὸς","strong":"G23160","lemma":"θεός","morph":"Gr,N,,,,,NMS,","occurrence":1,"occurrences":1}],"bottomWords":[{"word":"God","occurrence":1,"occurrences":1,"type":"bottomWord"}]}

PhotoNomad0 commented 4 years ago

So the alignment memory should be ὁ Θεὸς=the God

PhotoNomad0 commented 4 years ago

There is also a case where ὁ Θεὸς=God in 12:26 (the current verse).

PhotoNomad0 commented 4 years ago

Summary: @neutrinog found an instance where Θεὸς is aligned to the God so it is a valid suggestion, but the old algorithm did better.

da1nerd commented 4 years ago

I think we could deal with the out of order occurrences after the predictions are generated and when the engine moves into building the suggestion. At that point we could insert certain rules like keeping the word occurrences in order.

To illustrate here's some handy ASCII art:

[input]->[generate index]->[run prediction algorithms]->[generate suggestion]

The last step above is where we'd enforce the order of occurrence. Previously I had been trying to do so in the algorithms which wasn't working.

da1nerd commented 4 years ago

This is the issue I'm running into with the current AlignmentPosition algorithm. This isn't really meant to make sense to anyone but me, but basically because of how the numbers are distributed, the closest pair of numbers are out of order.

image

da1nerd commented 4 years ago

I ended up solving this with what may not be the most elegant solution, but it works for now and the performance hit isn't noticeable at the moment. After scoring all of the predictions the engine will selectively build out a suggestion. During this process it will monitor word occurrences, and discard any suggestions that produce anything out of order.

In most situations this should complete within a reasonable amount of time. However, it's theoretically possible this could add a lot of time to prediction.

da1nerd commented 4 years ago

word occurrence order has received a lot of attention in https://github.com/unfoldingword/translationcore/issues/6237. This issue is redundant/irrelevant now.