sillsdev / machine.py

Machine is a natural language processing library for Python that is focused on providing tools for processing resource-poor languages.
MIT License
10 stars 2 forks source link

Proper error handling for very long segments #21

Closed johnml1135 closed 1 year ago

johnml1135 commented 1 year ago

If there are more than ~1000 tokens, the model will give an error. We need to either handle this with a graceful error or somehow break up the segment into multiple segments and then recombine.

johnml1135 commented 1 year ago

Just give the error unless there is a really easy way to do otherwise. "doing it better" is handled in https://github.com/sillsdev/silnlp/issues/182.

johnml1135 commented 1 year ago

One other potential for these really long strings is just to truncate them to the max number of tokens (or 70% of tokens?). Therefore, no error needs to be given and the user at least gets something, even if it is not complete. Also if could be put at the end (or similar - bonus if it is in the source language), that could even be better. This is only a stop-gap, the ideal is to break it up, so don't spend too long.

johnml1135 commented 1 year ago

NLLB only allows 200 tokens max -> https://huggingface.co/facebook/nllb-200-3.3B/blob/main/config.json. Therefore, we should just truncate to 200 tokens and be done with it.

johnml1135 commented 1 year ago

This should be done...