mosaicml / llm-foundry

LLM training code for Databricks foundation models
https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm
Apache License 2.0
4.02k stars 524 forks source link

How did you train the MPT-7b-chat model? #343

Closed eldarkurtic closed 1 year ago

eldarkurtic commented 1 year ago

Hi everyone, thanks a lot for the great library!

Could someone please explain a bit more on how the chat model has been trained? More specifically, I am interested in how the input/output data has been preprocessed.

For example, the instruct model is using:

prompt = "Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: what is kangen water? ### Response: "
response = "Kangen water is alkaline ionized water ...".

which follows the expected format of the llmfoundry-SFT dataloader where the model is trained to predict only the response part and inputs are formatted as: Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: <some_text>. ### Response: <some_text>"

  1. Is the same format used for finetuning of the chat model? From the blog post, I see that the chat model is trained on a mixture of instruction (e.g. Alpaca) and chat datasets (e.g. HH_RLHF). This confuses me a bit, because I am not sure how they are formatted and used together. Is HH_RLHF formatted in the instruct-format: Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: <some_text>. ### Response: <some_text> where Human part is used as Instruction and Assistant is used as Response, or Alpaca is reformatted to the chat format used in hf_chat.py script: You are a helpful assistant. User: <some_text>. Assistant: <model_write_here> ?
  2. How do you handle multi-turn conversations from HH_RLHF dataset during the finetuning stage? Do you just train on the last answer given the entire conversation history as the prompt or something else?

Thanks for your time and answers!

sasaadi commented 1 year ago

What I learned is that alpaca and other datasets are converted to ChatML syntax. But the details of finetuning is also my question.

samhavens commented 1 year ago
  1. We actually use neither of the formats you mention. We use ChatML syntax, described here and implemented in code in this file.
  2. Each turn of a multi-turn conversation results in 1 training sample. Only the response tokens are loss-generating, so it only learns from a given response once.
tylerweitzman commented 1 year ago

So there's no reinforcement learning?

samhavens commented 1 year ago

There was no RL involved in training mpt-7b-chat or mpt-30b-chat

TamirHCL commented 1 year ago

@samhavens , could you please elaborate? You wrote

Each turn of a multi-turn conversation results in 1 training sample. Only the response tokens are loss-generating, so it only learns from a given response once. So I am wondering what exactly you mean.

Given the following conversation:

User: My name is John. What is 1+1? Assistant: Hi John, 1+1=2 User: What is my name? Assistant: Your name is John

Would we get the following two training datas/samples: 1-1) User: My name is John. What is 1+1? Assistant: Hi John, 1+1=2

1-2) User: What is my name? Assistant: Your name is John

Or would we get the following two training datas/samples:

2-1) User: My name is John. What is 1+1? Assistant: Hi John, 1+1=2

2-2) User: My name is John. What is 1+1? Assistant: Hi John, 1+1=2 User: What is my name? Assistant: Your name is John

Because from what you wrote it seems like it would be according to the first scenario, but in that case the assistant wouldn't know the user's name is John. Can you please clarify or explain to me what I am missing?

samhavens commented 1 year ago

@TamirHCL The second is what I meant, but I see how what I said was misleading. I just meant there was a 1-to-1 relationship, not that each turn ended up in only one sample.

TamirHCL commented 1 year ago

@samhavens thank you so much for clarifying! On a separate topic, can I ask how come the storywriter model was trained with max_seq_len set to 65K? Was it because of memory limitations with 80GB GPUs? Or merely because you needed some limit and this seemed decent?

samhavens commented 1 year ago

@TamirHCL We needed some max_seq_len, and at that point, the longest LLM context that anyone had access to was 8k, so we wanted to go well beyond that

TamirHCL commented 1 year ago

@samhavens Follow-up questions regarding long prompts.

1) Is there a rule of thumb for how much much VRAM I need to train MPT-7B with difference sequence lengths? How many 80GB GPUs are needed to train with max_seq_len set to 80K and 100K?

2) Is there some recommended value to set "PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb" to when training with max_seq_len set to 60K-100K?