Open johnml1135 opened 10 months ago
Initially, we should try stripping out unused token embeddings when fine tuning for a particular language pair.
Starting to work on pruning embeddings again using the hf-trim library. I need to test my implementation for M2M100 and then I will evaluate on several language pairs.
The full results are in the "LayerReuctionMethods" spreadheet in the shared drive, under the "Vocab Pruning" tab. I ran some experiments using an early stopping metric as opposed to training for 5k steps, but because the scores of these models (baseline and pruning) did not reach those of the baseline models running for only 5k steps, I am ignoring those results.
I used two methods for deciding what to prune:
As of now, the only conclusion I can draw is that these methods of pruning can successfully be used in certain situations to reduce the memory usage of models without meaningfully impacting the scores. The main question that remains is whether these conditions can be well defined, and if so, do they encompass a large enough portion of our use cases to justify adding this functionality as a training option.
Additional observations:
Did you ever try keeping the tokens that were found in the training and test data?
For method 1 above, I kept all the tokens from the source/target train and val splits, as well as the source test split. I didn't include the target test split.
For 4 of the 7 language pairs I tested, I looked at the token coverage of the test data (just the target test split) by the training data (everything else). The average raw coverage was 99.5% (of ~14,500 average tokens) and the average coverage of unique tokens was 98.6% (of 963 average tokens).
If we can distill or prune the NLLB-200 shortly after starting fine tuning, we may be able to dramatically reduce (50% or more) the training and inferencing time needed. It could even do something like this: