Distill or prune model to save training time

johnml1135 commented 10 months ago

If we can distill or prune the NLLB-200 shortly after starting fine tuning, we may be able to dramatically reduce (50% or more) the training and inferencing time needed. It could even do something like this:

Take the 3.3GB model and train for 1000 steps on the 2x A100's. Prune and save.
Load the model on a single A100 and finish training and inferencing.

johnml1135 commented 10 months ago

A few resources on pruning:

ddaspit commented 9 months ago

We can prune a model during training using Optimum.

ddaspit commented 2 months ago

Initially, we should try stripping out unused token embeddings when fine tuning for a particular language pair.

isaac091 commented 1 month ago

Starting to work on pruning embeddings again using the hf-trim library. I need to test my implementation for M2M100 and then I will evaluate on several language pairs.

isaac091 commented 4 days ago

Update on pruning embeddings

The full results are in the "LayerReuctionMethods" spreadheet in the shared drive, under the "Vocab Pruning" tab. I ran some experiments using an early stopping metric as opposed to training for 5k steps, but because the scores of these models (baseline and pruning) did not reach those of the baseline models running for only 5k steps, I am ignoring those results.

I used two methods for deciding what to prune:

Only keeping the tokens found in the training data (all except the target test set)
Keeping all tokens using the same script as the training data (the source and target data used the same script for all the language pairs tested here). I also included the tokens from (1) so that necessary punctuation/other miscellaneous tokens were preserved.

As of now, the only conclusion I can draw is that these methods of pruning can successfully be used in certain situations to reduce the memory usage of models without meaningfully impacting the scores. The main question that remains is whether these conditions can be well defined, and if so, do they encompass a large enough portion of our use cases to justify adding this functionality as a training option.

Additional observations:

Unsurprisingly, the main factor that seems to contribute to the performance of a model with pruned embeddings is how well-resourced the languages/scripts of the language pair are (i.e. how well those languages/scripts were represented in the training data of the foundation model)
On average, the training speed of the models increased with smaller vocabularies, but not in a consistently significant way
Two language pairs stuck out as having the worst score decreases as a result of pruning the embeddings
- In the worse of the two, the only standout factor was that the pair used Devanagari script. This may or may not be the actual cause of the difference, but otherwise the size of the training data and the pruned embeddings was similar to those of other pairs.
- The second pair did use Latin script, but the source language, Swahili, is not as well resourced as the other Latin script languages represented here, and the size of the training data was half that of the other pairs

ddaspit commented 3 days ago

Did you ever try keeping the tokens that were found in the training and test data?

isaac091 commented 3 days ago

For method 1 above, I kept all the tokens from the source/target train and val splits, as well as the source test split. I didn't include the target test split.

For 4 of the 7 language pairs I tested, I looked at the token coverage of the test data (just the target test split) by the training data (everything else). The average raw coverage was 99.5% (of ~14,500 average tokens) and the average coverage of unique tokens was 98.6% (of 963 average tokens).

sillsdev / silnlp

Distill or prune model to save training time #288

Update on pruning embeddings