Research: Auto-selection of best Source Bible

johnml1135 commented 1 year ago

A key selection for translating is the selection of the source bible text. The best source scripture may not be the one primarily being used as the basis for manual translation. The "best" may be related to:

How well NLLB understands the source language, and can translate from it
How closely the source language is to the target language (the formal equivalence, alignment, subject-objevt-verb agreement, other characteristics, etc.
Translation decisions made - literal to dynamic
Whether there is a back-translation available
The translator's subjective experience of the helpfulness of the draft/suggestions

To choose the best source text optimize the translator's experience, and to minimize the computational and translator burden of decision, I propose the following:

Rank the NLLB languages based upon the amount of language that it was trained upon
Calculate the alignment/formal equivalence between the target and 100's of translations
Combine the alignment and "NLLB goodness" score in some way - either filtering out Bibles or adding them together, or selecting the best overall formal alignment within the top 10 languages and also the overall best
Choose the 3-5 best-chance bibles (including the source text) and fine tune on all of them

Now, let the user determine the best of the 3-5 by:

Pretranslate 50 or so verses
Present the same verse pretranslated from 2 different models to the translator and ask them "which one would you rather start from?"
Using some cycling, random choosing which one is on top, etc., gather stats on which model produces the better "subjective" starting point for translation
Use that one for ongoing translations - allowing the translator to re-do the process a few times if desired.

johnml1135 commented 1 year ago

This may require running many, many alignments - and therefore probably should have two things done to improve SMT performance:

Implement eflomal
Use a non-AWS computer to accelerate - which would need exposing the mongo DB to the internet and using S3 bucket for SMT job data transfer

johnml1135 commented 1 year ago

This issue is just for the research to see if it is valuable. If it is, we can walk through the steps of implementation.

johnml1135 commented 1 year ago

As this paper proposes, we may be able to setup multiple languages for one model. Instead of going through a lot of work to get the best single source and training 5-10 models, we may be able to choose 3-5 best aligned translations and just train off of all of them. Then, we can try each as a source when checking out which source works best. That should dramatically reduce GPU training time.

johnml1135 commented 12 months ago

These ideas are moved here: https://docs.google.com/document/d/1SXWLj6FY89cowQJVO-XY6q5BpDNaHmiesYPQ_Wxo5q4/edit. They will be considered as the onboarding flow is worked out.

sillsdev / serval

Research: Auto-selection of best Source Bible #65