huggingface / nanotron

Minimalistic large language model 3D-parallelism training
Apache License 2.0
1.14k stars 107 forks source link

Adding support for training chat models #187

Closed TJ-Solergibert closed 1 month ago

TJ-Solergibert commented 4 months ago

[!CAUTION] 🚨 This is a draft, still in development, and further testing needs to be done. Feel free to leave any comments!

This PR includes everything necessary to train chat models with:

  1. Sample packing
  2. No cross-attention contamination between packed samples
  3. Training on completions only (the answers generated by the model)

This image from @sz128 is very useful to understand 1. & 2.: image

I am developing this feature with axolotl's implementation as a reference. The current status is as follows:

Dataset

IterableDatasets

This time, I have opted for and IterableDataset instead of a map style one. The obvious benefits are that we tokenize on the fly, which allows us to easily experiment with different models/tokenizers/chat templates and saves disk space by not storing the tokens. However, the drawbacks are:

Of all these inconveniences, the one that worries me the most is the third one, but I trust that they will develop an optimal solution soon. We can easily develop solutions for the first and second issues, and the fourth one does not seem too problematic, although we could also address it.

How Samples Are Produced

In short, we extract samples from the dataset and apply the chat template until we can no longer fit a full Question-Answer pair into the sequence length of the sample we are constructing. We save this last Question-Answer pair for the next sample and pad the sample we are constructing (In the case of the Llama3 tokenizer as we don't have a pad token we use the <|eot_id|> token). We do this so that each sample has several completed Question-Answer pairs. This packing is greedy, although there are more complex strategies to minimize the number of pad tokens.

The important thing here is that we have developed the ChatTokenizer class to apply the chat template manually and not use solutions like the apply_chat_template method of tokenizers. We do this to know at the token level if each one belongs to the assistant's responses or not for the feature of training only on the assistant's tokens. I have added an assert to verify that the result of applying the chat template is exactly the same as the apply_chat_template method of tokenizers.

Dataset Samples

I have developed this notebook so you can check the batches produced by the DataLoader. In summary, the most relevant features are:

[!NOTE] The label id token '-' is actually -100. We switch it because tokenizer.convert_ids_to_tokens can't convert '-100' token.

  • When training just on the assistant's answers, the first token we predict is ÄŠÄŠ, which corresponds to "\n\n" from the Llama3 chat template. When interacting with a model we prompt a question + apply chat template and the model starts generating from this "\n\n" token. Screenshot 2024-05-28 at 02 21 50

Other Considerations

As I already mentioned, the final two configurations will be for evaluating the effect of these two functionalities. I would remove them for the final release since I do not see the benefit of not activating them.

What Is Still Missing:

TODOs:

xrsrke commented 1 month ago

Close. In favor of https://github.com/swiss-ai/nanotron/pull/14