sillsdev / silnlp

A set of pipelines for performing experiments on various NLP tasks with a focus on resource-poor/minority languages.
Other
31 stars 3 forks source link

Distill or prune model to save training time #288

Open johnml1135 opened 10 months ago

johnml1135 commented 10 months ago

If we can distill or prune the NLLB-200 shortly after starting fine tuning, we may be able to dramatically reduce (50% or more) the training and inferencing time needed. It could even do something like this:

johnml1135 commented 10 months ago

A few resources on pruning:

ddaspit commented 9 months ago

We can prune a model during training using Optimum.

ddaspit commented 2 months ago

Initially, we should try stripping out unused token embeddings when fine tuning for a particular language pair.

isaac091 commented 1 month ago

Starting to work on pruning embeddings again using the hf-trim library. I need to test my implementation for M2M100 and then I will evaluate on several language pairs.

isaac091 commented 4 days ago

Update on pruning embeddings

The full results are in the "LayerReuctionMethods" spreadheet in the shared drive, under the "Vocab Pruning" tab. I ran some experiments using an early stopping metric as opposed to training for 5k steps, but because the scores of these models (baseline and pruning) did not reach those of the baseline models running for only 5k steps, I am ignoring those results.

I used two methods for deciding what to prune:

  1. Only keeping the tokens found in the training data (all except the target test set)
  2. Keeping all tokens using the same script as the training data (the source and target data used the same script for all the language pairs tested here). I also included the tokens from (1) so that necessary punctuation/other miscellaneous tokens were preserved.

As of now, the only conclusion I can draw is that these methods of pruning can successfully be used in certain situations to reduce the memory usage of models without meaningfully impacting the scores. The main question that remains is whether these conditions can be well defined, and if so, do they encompass a large enough portion of our use cases to justify adding this functionality as a training option.

Additional observations:

ddaspit commented 3 days ago

Did you ever try keeping the tokens that were found in the training and test data?

isaac091 commented 3 days ago

For method 1 above, I kept all the tokens from the source/target train and val splits, as well as the source test split. I didn't include the target test split.

For 4 of the 7 language pairs I tested, I looked at the token coverage of the test data (just the target test split) by the training data (everything else). The average raw coverage was 99.5% (of ~14,500 average tokens) and the average coverage of unique tokens was 98.6% (of 963 average tokens).