Closed eldarkurtic closed 1 year ago
What I learned is that alpaca and other datasets are converted to ChatML syntax. But the details of finetuning is also my question.
So there's no reinforcement learning?
There was no RL involved in training mpt-7b-chat or mpt-30b-chat
@samhavens , could you please elaborate? You wrote
Each turn of a multi-turn conversation results in 1 training sample. Only the response tokens are loss-generating, so it only learns from a given response once. So I am wondering what exactly you mean.
Given the following conversation:
User: My name is John. What is 1+1? Assistant: Hi John, 1+1=2 User: What is my name? Assistant: Your name is John
Would we get the following two training datas/samples: 1-1) User: My name is John. What is 1+1? Assistant: Hi John, 1+1=2
1-2) User: What is my name? Assistant: Your name is John
Or would we get the following two training datas/samples:
2-1) User: My name is John. What is 1+1? Assistant: Hi John, 1+1=2
2-2) User: My name is John. What is 1+1? Assistant: Hi John, 1+1=2 User: What is my name? Assistant: Your name is John
Because from what you wrote it seems like it would be according to the first scenario, but in that case the assistant wouldn't know the user's name is John. Can you please clarify or explain to me what I am missing?
@TamirHCL The second is what I meant, but I see how what I said was misleading. I just meant there was a 1-to-1 relationship, not that each turn ended up in only one sample.
@samhavens thank you so much for clarifying! On a separate topic, can I ask how come the storywriter model was trained with max_seq_len set to 65K? Was it because of memory limitations with 80GB GPUs? Or merely because you needed some limit and this seemed decent?
@TamirHCL We needed some max_seq_len, and at that point, the longest LLM context that anyone had access to was 8k, so we wanted to go well beyond that
@samhavens Follow-up questions regarding long prompts.
1) Is there a rule of thumb for how much much VRAM I need to train MPT-7B with difference sequence lengths? How many 80GB GPUs are needed to train with max_seq_len set to 80K and 100K?
2) Is there some recommended value to set "PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb" to when training with max_seq_len set to 60K-100K?
Hi everyone, thanks a lot for the great library!
Could someone please explain a bit more on how the chat model has been trained? More specifically, I am interested in how the input/output data has been preprocessed.
For example, the instruct model is using:
which follows the expected format of the llmfoundry-SFT dataloader where the model is trained to predict only the
response
part and inputs are formatted as:Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: <some_text>. ### Response: <some_text>"
Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: <some_text>. ### Response: <some_text>
whereHuman
part is used asInstruction
andAssistant
is used asResponse
, or Alpaca is reformatted to the chat format used inhf_chat.py
script:You are a helpful assistant. User: <some_text>. Assistant: <model_write_here>
?Thanks for your time and answers!