Break up large sentences for training and inferencing on NLLB

sillsdev / silnlp

A set of pipelines for performing experiments on various NLP tasks with a focus on resource-poor/minority languages.

Other

35 stars 3 forks source link

Break up large sentences for training and inferencing on NLLB #182

Closed johnml1135 closed 1 year ago

johnml1135 commented 1 year ago

At around 200 tokens, the accuracy drops off. Research - should we split them up? How would we do that?

johnml1135 commented 1 year ago

The NLLB 600MB, 1.3GB and 3.3GB each have max length of 200 - https://huggingface.co/facebook/nllb-200-3.3B/blob/main/config.json. We should chop it to that length - that is, get the length from the model and then break it up.

johnml1135 commented 1 year ago

First analysis - a histogram of number of tokens per segment (+ number of segments with over 200 tokens) for the following languages:

English
Simplified Chinese (symbol based)
Other languages of interest
A new script with per character tokenization
A new script with optimal tokenization (whole Bible) - 500 words
A new script with optimal tokenization (whole Bible) - 2000 words
A language that needs many words for a simple verse (look at size of zipped paratext project for ideas)

This analysis will help prioritize this issue. How many segments in a Bible, how many languages are affected? Are the languages we are working on in the immediate future affected?

mshannon-sil commented 1 year ago

Here's a link to a spreadsheet containing segment length analysis for a variety of languages: https://docs.google.com/spreadsheets/d/1qOekM7VbYhhbNgM0s8x6JuGOx9DjN4r9hKcEQO9AhS4/edit?usp=sharing

mshannon-sil commented 1 year ago

There seem to be relatively few segments with lengths over 200 for most languages even without the trained tokenizer. With the trained tokenizer, this number is reduced even more. Limbu with the devanagari script is the notable exception, with a very large number of segments with lengths over 200 when using the base tokenizer. However, using the Limbu script for Limbu and using the updated tokenizer reduces this to a much lower number.

mshannon-sil commented 1 year ago

Also, all the segments outputted to the files are capped at 200 tokens, so there are no segments > 200 tokens, which means the max length calculated for the languages is only accurate when the max segment length is < 200 tokens. This also meant that I had to calculate the number of lengths >= 200 rather than just > 200, but I think for the purposes of this analysis that should work.

johnml1135 commented 1 year ago

Can you put a few statistics on "lower" or "much lower"? 1%? 0.1%? 127 of 31,102?

johnml1135 commented 1 year ago

The updated tokenizer makes this almost a non-issue, though without retokenizing (even if all the characters are recognized), there can be issues where up to 1% are over the 200 token length.

Other than refining when and how to use the tokenizer best, I see no more work that needs to be done here.

mshannon-sil commented 1 year ago

I agree. I just added the script I used for the segment length analysis under silnlp/nmt in the latest commit 554708ef3fe99ec0ded6627a3fd013638ef4fbb0.