Improvements for aligning greek to greek.

da1nerd commented 5 years ago

Some resources are aligned to different greek texts. Since translationCore is tied to a specific greek text this renders those other resources useless. Aligning the different greek sources together could provide a "key" that would enable resources based on a different greek text to be used inside of translationCore.

Converting resources using these "keys" could be performed in a separate CLI (not wordMap).

translationCore uses the UGNT greek text. We can start by testing with aligning the westcot hort greek text to the UGNT.

The first step will be to collect a sample of the WH text. Potential resources:

da1nerd commented 5 years ago

Quoted from @klappy:

So there is a growing problem of different people anchoring to different Greek sources. We use UGNT, others have use WH (Wescott Hort), others, NA (Nestle Allen), and some the TR (Textus Receptus). In theory we could use WordMAP to align all of these to the UGNT.

The features that would be most helpful is aligning matches on not only attributes of the same language in the memory, but across languages. In other words, if most of these the words are the same, between the languages of the same verse, but a few words differ in spelling, or word order, we can look at the language A and compare it’s attributes to language B. For example, lemma fallback not only to the same language, but to the other language. Different spellings, same lemma. This could potentially grow into more features for parts of speech or morphology as well.

Then we could allow Greek scholars to use tC’s WordAlignment tool to import their Greek text of choice, and align it to the UGNT, and WordMAP would further make suggestions while they are aligning verse by verse, confirming WordMAP’s suggestions. Once we have the exported explicit alignment between each of them and the UGNT, we can then map with a high degree of confidence a Gateway Language that is previously aligned to something other than the UGNT, by proxy. Once we map the GL => (WH/NA/TR) => UGNT, we create an aligned usfm file importable by tC.

da1nerd commented 5 years ago

Additional considerations: If the source and target text both contains lemmas in their tokens use these directly for matching. We would not normally do this for alignment since languages do not normally have common lemmas. So wordMap will need a switch that somehow indicates the languages are the same, or just have some fine grained options regardless of language.

unfoldingWord / wordMAP

Improvements for aligning greek to greek. #47