Closed johnml1135 closed 1 year ago
Just give the error unless there is a really easy way to do otherwise. "doing it better" is handled in https://github.com/sillsdev/silnlp/issues/182.
One other potential for these really long strings is just to truncate them to the max number of tokens (or 70% of tokens?). Therefore, no error needs to be given and the user at least gets something, even if it is not complete. Also if
NLLB only allows 200 tokens max -> https://huggingface.co/facebook/nllb-200-3.3B/blob/main/config.json. Therefore, we should just truncate to 200 tokens and be done with it.
This should be done...
If there are more than ~1000 tokens, the model will give an error. We need to either handle this with a graceful error or somehow break up the segment into multiple segments and then recombine.