Closed johnml1135 closed 1 year ago
The NLLB 600MB, 1.3GB and 3.3GB each have max length of 200 - https://huggingface.co/facebook/nllb-200-3.3B/blob/main/config.json. We should chop it to that length - that is, get the length from the model and then break it up.
First analysis - a histogram of number of tokens per segment (+ number of segments with over 200 tokens) for the following languages:
This analysis will help prioritize this issue. How many segments in a Bible, how many languages are affected? Are the languages we are working on in the immediate future affected?
Here's a link to a spreadsheet containing segment length analysis for a variety of languages: https://docs.google.com/spreadsheets/d/1qOekM7VbYhhbNgM0s8x6JuGOx9DjN4r9hKcEQO9AhS4/edit?usp=sharing
There seem to be relatively few segments with lengths over 200 for most languages even without the trained tokenizer. With the trained tokenizer, this number is reduced even more. Limbu with the devanagari script is the notable exception, with a very large number of segments with lengths over 200 when using the base tokenizer. However, using the Limbu script for Limbu and using the updated tokenizer reduces this to a much lower number.
Also, all the segments outputted to the files are capped at 200 tokens, so there are no segments > 200 tokens, which means the max length calculated for the languages is only accurate when the max segment length is < 200 tokens. This also meant that I had to calculate the number of lengths >= 200 rather than just > 200, but I think for the purposes of this analysis that should work.
Can you put a few statistics on "lower" or "much lower"? 1%? 0.1%? 127 of 31,102?
The updated tokenizer makes this almost a non-issue, though without retokenizing (even if all the characters are recognized), there can be issues where up to 1% are over the 200 token length.
Other than refining when and how to use the tokenizer best, I see no more work that needs to be done here.
I agree. I just added the script I used for the segment length analysis under silnlp/nmt in the latest commit 554708ef3fe99ec0ded6627a3fd013638ef4fbb0.
At around 200 tokens, the accuracy drops off. Research - should we split them up? How would we do that?