Open johnml1135 opened 1 month ago
@johnml1135 It seems to me like "only train on key terms, then generate a draft for my first book" is a valid but very unusual use-case. It would have to be a new project in an NLLB language.
We could filter the key terms by book/chapter. Each key term has a list of verses that they occur in.
@ddaspit, so, filter on any data trained on or pretranslated? That would leave us with the same issue - namely that if you just want to translate from English to Spanish using NLLB200, that is now prevented. If we want to implement that filter, I would consider that a separate enhancement.
You are correct. It would still train the model. This issue made me realize that we should filter the key terms.
We already have the use_key_terms
build option, which excludes the key terms from the training data. That might be sufficient.
If it is, we should test it out (at least manually) and then document it.
use_key_terms
would be sufficient to not train any segments and still allow NLLB pretranslations without training. The filtering of key terms is also implemented.
Actually, the Serval changes need to be merged in first before this is completed.
Should we add a separate flag for "only pretranslate"? Or should we automagically work if there is no matching corpora, we don't include the keyterms?