Keyterm data always gets added - and then we always train

sillsdev / serval

A REST API for natural language processing services

MIT License

4 stars 0 forks source link

Keyterm data always gets added - and then we always train #476

Open johnml1135 opened 1 month ago

johnml1135 commented 1 month ago

Should we add a separate flag for "only pretranslate"? Or should we automagically work if there is no matching corpora, we don't include the keyterms?

Nateowami commented 1 month ago

@johnml1135 It seems to me like "only train on key terms, then generate a draft for my first book" is a valid but very unusual use-case. It would have to be a new project in an NLLB language.

ddaspit commented 1 month ago

We could filter the key terms by book/chapter. Each key term has a list of verses that they occur in.

johnml1135 commented 1 month ago

@ddaspit, so, filter on any data trained on or pretranslated? That would leave us with the same issue - namely that if you just want to translate from English to Spanish using NLLB200, that is now prevented. If we want to implement that filter, I would consider that a separate enhancement.

ddaspit commented 1 month ago

You are correct. It would still train the model. This issue made me realize that we should filter the key terms.

We already have the use_key_terms build option, which excludes the key terms from the training data. That might be sufficient.

johnml1135 commented 1 month ago

If it is, we should test it out (at least manually) and then document it.

johnml1135 commented 1 day ago

use_key_terms would be sufficient to not train any segments and still allow NLLB pretranslations without training. The filtering of key terms is also implemented.

johnml1135 commented 1 day ago

Actually, the Serval changes need to be merged in first before this is completed.