sillsdev / silnlp

A set of pipelines for performing experiments on various NLP tasks with a focus on resource-poor/minority languages.
Other
35 stars 3 forks source link

Improving accuracy of model when adding new language #189

Closed mshannon-sil closed 1 year ago

mshannon-sil commented 1 year ago

Currently, the tokenizer_updates branch allows the tokenizer to support new languages by either adding characters or trained tokens. However, the model then needs to be trained/finetuned with these tokens to produce more accurate output.

Meta AI has put out a paper that covers pretty much our exact use case of adding new languages to NLLB https://aclanthology.org/2023.eacl-main.111.pdf. It looks like they have three main techniques: custom initialization of new embeddings and new layers in the model, mixing old and new training data together but with the new data upsampled, and reducing the learning rate for old parameters not related to the new embeddings or layers. I'm going to spend some time adapting these techniques to our use case and see how well it performs.

Also, there is a separate github issue open for implementing a transfer learning approach where an extra layer is added and all other layers are frozen. That may help improve model accuracy with new languages as well.

johnml1135 commented 1 year ago

Things to try based off of reading https://aclanthology.org/2023.eacl-main.111.pdf:

  1. Use the \<UNK> token embedding
  2. Double the width according to the paper "For wider models, we double the hidden dimension size to 8192"
    • Initialize with noisy version of itself: "We inject zero-mean Gaussian noise with std = 0.01. We also tried not adding noise to the new parameters, which has almost identical performance." + normalization
    • Use learning rates as specified in the paper
  3. Ignore the + 6 depth approach
  4. Punt on adding more corpora, either FLORES 200 or eBibles - https://github.com/sillsdev/silnlp/issues/187
  5. Use learning rate specified by paper, but instead of adding more width, apply the slower rate to layers 3 to n-2.
  6. Modified depth - add 2 new layers for both the encoder and decoder, randomly initialized, and use a learning rate for the model starting at 0 for the first 1000 steps and then linearly increasing for the next 7000

Note: Approach 5 and 6 are modifications from the paper, with the knowledge that we don't care about forgetting the other languages - we just want to use the general knowledge to make the new language best.

ddaspit commented 1 year ago

This paper is trying to achieve a different goal. Their primary goal is to add new languages to an existing MMT model without fully retraining. Our goal is to train a model for a particular language pair by fine tuning. They want to maintain the performance of the existing language pairs while also maximizing the performance of the new language pairs. We are only interested in maximizing the performance of a single language pair in a specific domain. In other words, they want to train a multilingual model and we want to train a bilingual model.

The data up-sampling and learning rate scaling techniques are used to avoid catastrophic forgetting of existing language pairs while maximizing learning of new language pairs. These techniques are not really applicable to our use case.

There are two parts to the parameter initialization technique: wider/deeper architecture and token embeddings. The main purpose of using a larger architecture is to increase the capacity of the model to support the additional languages. This is not relevant for our use case. Token embedding initialization is relevant to us, since we will need to update the vocab for some models. Initializing new token embeddings to the <unk> embedding seems like a good practice for us.

ddaspit commented 1 year ago

The main takeaway from this paper for us is to use the <unk> embedding to initialize new token embeddings.

mshannon-sil commented 1 year ago

Thanks for the feedback! When I read the paper I figured their use case of expanding the number of languages of the model was similar enough to our use case of adding a single language to the model that at least some of the lessons would transfer over e.g. initializing with <unk>. But you're right, their desire to keep supporting all the original languages is a significant difference.

You mentioned that the main purpose of expanding the architecture is to support more languages and that it's not relevant to our use case. But it's still true that we don't want catastrophic forgetting in the model since that could affect performance even on the languages we're training on right? So I would think that adding new parameters by increasing width/depth would be a helpful way of training the model on the new language while avoiding that problem. Or does the default state of the model not contribute significantly to translations in new language directions and it's okay to forget it?

ddaspit commented 1 year ago

Increasing the size of the model is not really to avoid catastrophic forgetting. It is used to increase the capacity of the model so that it can learn the new languages. Here is the relevant quote from the paper:

During the continual learning stage, we may also want to increase the model size overall to have extra capacity to learn the new languages and improve old languages at the same time.

We don't need to increase the capacity of the model, since we are only interested in learning a single language pair. In fact, a model that has too much capacity for the desired task can have a negative effect, since it can result in overfitting. The key difference is that they are interested in continual/incremental learning, and we are interested in transfer learning.

Of course, sequence-to-sequence models can be unpredictable, so I could be wrong and increasing the size does give a small improvement. I'm just trying to efficiently direct our limited resources.

ddaspit commented 1 year ago

I would say that if anything (other than using the <unk> embedding) would be helpful for our use case, it would be messing with the learning rate. I could see how it might be helpful to have a higher learning rate for the new token embeddings than the existing token embeddings. Of course, this seemed to have the smallest impact of all of the techniques (just 0.1 improvement in spBLEU for new languages), so probably not worth it.

johnml1135 commented 1 year ago

I could agree with your recommendation Damien - that doing the <unk> tokens and trying a few different learning rates may be the best scope for finishing this up. We can take a look at model sizes when we do #187.

mshannon-sil commented 1 year ago

Sounds good. I've already written and tested some code to initialize the embedding to <unk>. I got some unexpected behavior though, where when I only added 500 tokens, it increased the BLEU score, but when I added more tokens such as 2000 or 4000, it dropped the BLEU score by a couple points compared to not using the <unk> embeddings on the same number of tokens. Any thoughts as to why this might be happening?

ddaspit commented 1 year ago

I agree #187 is the kind of scenario where increasing the model size might be helpful. Overall, I think we would be better off starting from a larger pretrained model (NLLB 3.3B) rather than increasing the size of the pretrained model when fine tuning.

ddaspit commented 1 year ago

@mshannon-sil I am guessing that using the <unk> embedding does not result in 100% consistent improvement across all languages. The paper noted only a 0.2 average spBLEU improvement when using the <unk> embedding. Some languages might see an improvement while others do not, but if on average it is better, then we should use it. We should make it a configuration option, so that we can run multiple experiments to see if we see an average improvement.

It is also important to note that their approach for adding new tokens is different than ours. They kind of cheat. For the M20 model, they train a sentencepiece model on all 20 languages with a vocab of 64K tokens. For the Mt25 model, they train a sentencepiece model on all 25 languages with a vocab of 64K tokens. They copy over the overlapping token embeddings and initialize the new tokens embeddings to the <unk> token embedding. They do not add tokens to the M20 sentencepiece model like we are doing. In addition, we are adding characters and not sentencepiece tokens. These differences might account of the discrepancy. In order to replicate their approach, we would need to train a new sentencepiece model using the original 200 languages + the new language, which is something we wanted to try (probably using the FLORES-200 dataset).

mshannon-sil commented 1 year ago

Sounds good, I'll make the <unk> embedding an option in the config file.

As for experimenting with learning rate, from what I've seen in the documentation it doesn't look like we can set different parts of the embedding to have different learning rates, since the entire embedding is considered one parameter and learning rate is set on a parameter basis. I have tried changing the learning rate according to method 5 outlined above using a couple different learning rates inspired by the paper, but it dropped the BLEU score by multiple points so that's likely not a viable solution.

With this, it sounds like we've looked into all the options from the paper that we might benefit from. Unless there's something else either of you would like me to try, I'll close out this issue once the tokenizer is merged into master.

johnml1135 commented 1 year ago

I agree with your path forward. Let's just integrate the changes into machine.py and move on to the next thing.

johnml1135 commented 1 year ago

A proposal for proceeding: