teknium1 commented 11 months ago

System Info

Hello, I've been working with dhokas who finetuned Mistral's official instruct model. I have been trying to finetune mistral with several datasets over dozens of ablations. There is very insane loss instability training this model with transformers that never seems to appear with his training runs which do not use hf trainer.

I am opening this so we can get to the bottom of this. Here are some of my runs using axolotl with some datasets.

With hermes 2.0 dataset (unpublished): https://wandb.ai/teknium1/hermes2.0-mistral-7b?workspace=user-teknium1

With Teknium/GPT4-LLM-CLEANED dataset https://wandb.ai/teknium1/gpt4llm-mistral-7b

With a 5-sequences run to ensure loss goes to 0 (that memorization is occurring): https://wandb.ai/teknium1/5seq-mistral-7b?workspace=user-teknium1

With OpenHermes dataset teknium1/openhermes: https://wandb.ai/teknium1/hermes-mistral-7b

as can be seen, these loss charts with all these ablations are unreliable, and generally produce bad results no matter what hyperparams are changed.

Mistral dev who worked with me, he trained mistral with gpt4llm cleaned and got this result:

@younesbelkada @muellerz

Who can help?

No response

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

Train Mistral on any of the above datasets with Mistral's own finetune hyperparams as reported in mistral's discord and see the loss fail to work out

Expected behavior

A smooth or downward trajectory for the loss.

teknium1 commented 11 months ago

I have tried: 2e-5, 1e-5, 8e-6, 6e-6, 4e-6, with and without flash attention/xformers/none, with and without packing, with 0.1 and 0.01 weight decay, with long, medium, and short warmups (between 0.01% and 80% warmup steps to total steps), I've tried with Hermes 2.0, Hermes 1.0 (which has been trained on llama fine in several occasions), and GPT4LLM datasets, I've tried with FSDP, With Deepspeed zero2 & zero3, with and without groupbylength, with updated adam beta and epsilons #adam_beta2: 0.95

adam_epsilon: 0.00001

with and without max_grad_norm: 1.0. I've basically run out of hyperparams to try tuning - several on fresh venv's

Ki6an commented 11 months ago

I have also come across an issue involving an irregular loss curve for finetuning mistral 7b. unusual_loss

teknium1 commented 11 months ago

For reference some of my loss charts:

akjindal53244 commented 11 months ago

I am facing the same issue and loss is going up while finetuning on Dolly-15k dataset.

adarshxs commented 11 months ago

Same for me with the garage-bAInd/Open-Platypus Dataset. Though mine was extremely weird

adamlin120 commented 11 months ago

Continue pre-training on Chinese/mandarin corpus

Optimizer adamw lr: 2.5e-5 Warmup: 4% Bs 2 Seq Len 1024 Used flash attention in the pr

adarshxs commented 11 months ago

Continue pre-training on Chinese/mandarin corpus

Optimizer adamw lr: 2.5e-5 Warmup: 4% Bs 2 Seq Len 1024 Used flash attention in the pr

Any specific library you using for continued pre training?

adamlin120 commented 11 months ago

Continue pre-training on Chinese/mandarin corpus

Optimizer adamw lr: 2.5e-5 Warmup: 4% Bs 2 Seq Len 1024 Used flash attention in the pr

Any specific library you using for continued pre training?

I am using SFTtrainer from trl. Noted that both runs failed. Orange one cannot converge. Green one dropped to loss=0.0 but in fact the model produced garbages

adarshxs commented 11 months ago

I am using SFTtrainer from trl. Noted that both runs failed. Orange one cannot converge. Green one dropped to loss=0.0 but in fact the model produced garbages

Same with fine tuning. The output is pure garbage even with all the standard hyperparams I used for fine tuning llama.

sparverius commented 11 months ago

With Teknium/GPT4-LLM-CLEANED dataset https://wandb.ai/teknium1/gpt4llm-mistral-7b

With a 5-sequences run to ensure loss goes to 0 (that memorization is occurring): https://wandb.ai/teknium1/5seq-mistral-7b?workspace=user-teknium1

@teknium1 these both 404 😞

teknium1 commented 11 months ago

With Teknium/GPT4-LLM-CLEANED dataset https://wandb.ai/teknium1/gpt4llm-mistral-7b With a 5-sequences run to ensure loss goes to 0 (that memorization is occurring): https://wandb.ai/teknium1/5seq-mistral-7b?workspace=user-teknium1

@teknium1 these both 404 😞

Sorry, my projects default to private, public'ed them

bdytx5 commented 11 months ago

How did you load your model?

teknium1 commented 11 months ago

How did you load your model?

with transformers? or do you mean precision?

bdytx5 commented 11 months ago

How did you load your model?

with transformers? or do you mean precision?

I was just wondering if you used one of the HuggingFace AutoModel classes or if you loaded it using the Mistral reference implementation.

teknium1 commented 11 months ago

How did you load your model?

with transformers? or do you mean precision?

I was just wondering if you used one of the HuggingFace AutoModel classes or if you loaded it using the Mistral reference implementation.

MistralForCausalLM

bdytx5 commented 11 months ago

How did you load your model?

with transformers? or do you mean precision?

I was just wondering if you used one of the HuggingFace AutoModel classes or if you loaded it using the Mistral reference implementation.

MistralForCausalLM

I see. I guess one idea to sanity check could be to load the model using the reference implementation and ensure it behaves similarly to the HuggingFace version.

teknium1 commented 11 months ago

How did you load your model?

with transformers? or do you mean precision?

I was just wondering if you used one of the HuggingFace AutoModel classes or if you loaded it using the Mistral reference implementation.

MistralForCausalLM

I see. I guess one idea to sanity check could be to load the model using the reference implementation and ensure it behaves similarly to the HuggingFace version.

Do you mean outside of huggingface/hf trainer? The mistral dev did do this, we have totally different training results when he trains the same dataset, same hyperparams, without hf trainer.

bdytx5 commented 11 months ago

How did you load your model?

with transformers? or do you mean precision?

I was just wondering if you used one of the HuggingFace AutoModel classes or if you loaded it using the Mistral reference implementation.

MistralForCausalLM

I see. I guess one idea to sanity check could be to load the model using the reference implementation and ensure it behaves similarly to the HuggingFace version.

Do you mean outside of huggingface/hf trainer? The mistral dev did do this, we have totally different training results when he trains the same dataset, same hyperparams, without hf trainer.

Yeah I mean just making sure both models are behaving similarly for a single forward/backwards pass on the same data without the trainer. If they are the same, then my guess is it probably narrows it down to the Trainer

teknium1 commented 11 months ago

Indeed, they are not the same. They are actually completely inverse lol

bdytx5 commented 11 months ago

Indeed, they are not the same. They are actually completely inverse lol

interesting.

Undi95 commented 11 months ago

Trying the Pippa-ShareGPT dataset from huggingface, the loss is big. https://wandb.ai/undis95/pippa-sharegpt-13b-qlora?workspace=user-undis95 I trained others datasets, but don't have screenshot of the loss nor the wandb.ai data since I just learned all this. Data and dataset can be seen at source, OG dataset are always linked:

https://huggingface.co/Undi95/Mistral-pippa-sharegpt-7b-qlora https://huggingface.co/Undi95/Mistral-7B-smoll_pippa-lora https://huggingface.co/Undi95/Mistral-7B-roleplay_alpaca-lora

Result are not the one I expected, and I can't find a way to train properly.

bdytx5 commented 11 months ago

I made a script that compares the last hidden state embeddings of both

Sampled values from Mistral embedding: [[-1.635 0.4966 -1.647 ] [ 0.1438 0.2181 0.0925 ] [ 0.2527 0.8457 0.8496 ] [ 0.1675 0.07324 1.037 ] [ 0.881 -0.614 0.1123 ]] Sampled values from Hugging Face embedding: [[-1.7 0.5347 -1.733 ] [ 1.075 1.69 0.7036] [ 1.983 6.86 6.73 ] [ 1.353 0.615 8.5 ] [ 9.23 -6.65 1.188 ]] Embedding difference (L2 norm): inf

see comparison script at https://github.com/bdytx5/mistral7B_finetune/blob/main/train/dev/cmp_models.py

also, you will have to add

def get_last_hidden_state(
    self,
    input_ids: torch.Tensor,
    cache: RotatingBufferCache,
    seqlens: List[int],
) -> torch.Tensor:
    assert len(seqlens) <= self.args.max_batch_size, f"Max batch size is {self.args.max_batch_size}, got batch size of {len(seqlens)}"
    assert sum(seqlens) == input_ids.shape[0], (sum(seqlens), input_ids.shape[0])

    input_metadata = cache.get_input_metadata(seqlens)
    h = self.tok_embeddings(input_ids)
    freqs_cis = self.freqs_cis[input_metadata.positions]

    for layer_id, layer in enumerate(self.layers):
        h = layer(h, freqs_cis, cache.get_view(layer_id, input_metadata))

    cache.update_seqlens(seqlens)

    return h  # Return the embeddings before the output layer.

into the 'transformer' class of the reference implementation

teknium1 commented 11 months ago

I made a script that compares the last hidden state embeddings of both

Sampled values from Mistral embedding: [[-1.635 0.4966 -1.647 ] [ 0.1438 0.2181 0.0925 ] [ 0.2527 0.8457 0.8496 ] [ 0.1675 0.07324 1.037 ] [ 0.881 -0.614 0.1123 ]] Sampled values from Hugging Face embedding: [[-1.7 0.5347 -1.733 ] [ 1.075 1.69 0.7036] [ 1.983 6.86 6.73 ] [ 1.353 0.615 8.5 ] [ 9.23 -6.65 1.188 ]] Embedding difference (L2 norm): inf

see comparison script at https://github.com/bdytx5/mistral7B_finetune/blob/main/train/dev/cmp_models.py

also, you will have to add
def get_last_hidden_state(
    self,
    input_ids: torch.Tensor,
    cache: RotatingBufferCache,
    seqlens: List[int],
) -> torch.Tensor:
    assert len(seqlens) <= self.args.max_batch_size, f"Max batch size is {self.args.max_batch_size}, got batch size of {len(seqlens)}"
    assert sum(seqlens) == input_ids.shape[0], (sum(seqlens), input_ids.shape[0])

    input_metadata = cache.get_input_metadata(seqlens)
    h = self.tok_embeddings(input_ids)
    freqs_cis = self.freqs_cis[input_metadata.positions]

    for layer_id, layer in enumerate(self.layers):
        h = layer(h, freqs_cis, cache.get_view(layer_id, input_metadata))

    cache.update_seqlens(seqlens)

    return h  # Return the embeddings before the output layer.        
into the 'transformer' class of the reference implementation

So is this the cause of the loss issues or just a cleaner more proper implementation?

bdytx5 commented 11 months ago

I made a script that compares the last hidden state embeddings of both Sampled values from Mistral embedding: [[-1.635 0.4966 -1.647 ] [ 0.1438 0.2181 0.0925 ] [ 0.2527 0.8457 0.8496 ] [ 0.1675 0.07324 1.037 ] [ 0.881 -0.614 0.1123 ]] Sampled values from Hugging Face embedding: [[-1.7 0.5347 -1.733 ] [ 1.075 1.69 0.7036] [ 1.983 6.86 6.73 ] [ 1.353 0.615 8.5 ] [ 9.23 -6.65 1.188 ]] Embedding difference (L2 norm): inf see comparison script at https://github.com/bdytx5/mistral7B_finetune/blob/main/train/dev/cmp_models.py also, you will have to add
def get_last_hidden_state(
    self,
    input_ids: torch.Tensor,
    cache: RotatingBufferCache,
    seqlens: List[int],
) -> torch.Tensor:
    assert len(seqlens) <= self.args.max_batch_size, f"Max batch size is {self.args.max_batch_size}, got batch size of {len(seqlens)}"
    assert sum(seqlens) == input_ids.shape[0], (sum(seqlens), input_ids.shape[0])

    input_metadata = cache.get_input_metadata(seqlens)
    h = self.tok_embeddings(input_ids)
    freqs_cis = self.freqs_cis[input_metadata.positions]

    for layer_id, layer in enumerate(self.layers):
        h = layer(h, freqs_cis, cache.get_view(layer_id, input_metadata))

    cache.update_seqlens(seqlens)

    return h  # Return the embeddings before the output layer.        
into the 'transformer' class of the reference implementation
So is this the cause of the loss issues or just a cleaner more proper implementation?

It's definitely possible that a difference in initial weights is causing the strange training behavior. I might try using the official weights and converting it with their script to make sure the weights on huggingface are the same as the official weights.

One thing I have noticed is the config class for the model has default "rms_norm_eps": 1e-06 where the config used on huggingface hub uses 1e-05. I'm not sure if this matters but I might try converting the weights to make sure that they were originally converted using the right config. You can find the default config here https://github.com/huggingface/transformers/blob/main/src/transformers/models/mistral/configuration_mistral.py

bdytx5 commented 11 months ago

To follow up Tek, fter looking a little closer at this final layer embeddings

Sampled values from Mistral embedding: [[-1.635 0.4966 -1.647 2.324 -0.1011 ] [ 0.1438 0.2181 0.0925 -1.136 0.2788 ] [ 0.2527 0.8457 0.8496 -0.4353 -0.3838 ] [ 0.1675 0.07324 1.037 -1.225 0.158 ] [ 0.881 -0.614 0.1123 -1.201 0.2915 ]] Sampled values from Hugging Face embedding: [[-1.706 0.593 -2.016 2.396 -0.05334] [ 2.277 0.762 0.0974 -8.88 3.088 ] [ 2.75 5.703 6.695 -4.22 -2.928 ] [ 1.782 -0.5884 8.914 -9.2 1.583 ] [ 7.8 -5.42 1.145 -9.29 4.605 ]] Embedding difference (L2 norm): inf

The huggingface outputs seem pretty high in comparison to the official ones which does seem suspicious...

younesbelkada commented 11 months ago

Hi @teknium1 @bdytx5

Reading through the thread and the options you have tried I first suspected that the issue might come from the new window causal mask On my end I have tried to FT mistral-7b using QLoRA, with 2 different approaches:

1- Using vanilla causal mask 2- Using the window attention mask

I have fine-tuned the 7B using QLoRA, this script and using a context length of 512 and sliding window size of 256 to make sure the sliding window mask will behave correctly: https://gist.github.com/younesbelkada/9f7f75c94bdc1981c8ca5cc937d4a4da with model_id being changed to mistral 7b, with packing and here is the behaviour of the losses

Screenshot 2023-10-03 at 13 52 24

Despite the model not "nicely" converging as the ideal loss curve you shared, the model manages to produce generation that are coherent with Guanaco dataset

# input: ### Human: Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research.### Assistant:

>>> '### Human: Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research.### Assistant: Monopsony is a market structure where there is only one buyer of a good or service. In the context of the labour market, a monopsony occurs when there is only one employer in a particular industry or region. This can happen for a variety of reasons, such as government regulation, natural monopolies, or the existence of a single large firm that dominates the market.\n\nThe concept of monopsony in the labour market has gained increasing attention in recent years'

Model weights here: https://huggingface.co/ybelkada/mistral-7b-guanaco

What @bdytx5 said makes sense, there might be some differences between original model's logits and ours, indeed HF version uses 1e-5: https://huggingface.co/mistralai/Mistral-7B-v0.1/blob/main/config.json#L16 whereas mistral uses 1e-6: https://github.com/mistralai/mistral-src/blob/main/mistral/model.py#L129

@teknium1 can you try to run a training with this version of the model instead: https://huggingface.co/mistralai/Mistral-7B-v0.1/discussions/35 just pass revision="refs/pr/35" when calling from_pretrained

danieldk commented 11 months ago

Reading through the thread and the options you have tried I suspected that the issue might come from the new window causal mask

I haven't looked into much detail yet, but the mask seems to unconditionally attend to cached key/values. Shouldn't the sliding window apply to cached key/values as well?

https://github.com/huggingface/transformers/blob/ae9a344cce52ff244f721425f660b55ebc522b88/src/transformers/models/mistral/modeling_mistral.py#L92

(In the case of generating a batch of single tokens at a time, there is also https://github.com/huggingface/transformers/blob/ae9a344cce52ff244f721425f660b55ebc522b88/src/transformers/models/mistral/modeling_mistral.py#L795C30-L795C30, which skips applying the window to the k/v cache.)

teknium1 commented 11 months ago

Hi @teknium1 @bdytx5

Reading through the thread and the options you have tried I first suspected that the issue might come from the new window causal mask On my end I have tried to FT mistral-7b using QLoRA, with 2 different approaches:

1- Using vanilla causal mask 2- Using the window attention mask

I have fine-tuned the 7B using QLoRA, this script and using a context length of 512 and sliding window size of 256 to make sure the sliding window mask will behave correctly: https://gist.github.com/younesbelkada/9f7f75c94bdc1981c8ca5cc937d4a4da with model_id being changed to mistral 7b, with packing and here is the behaviour of the losses

Despite the model not "nicely" converging as the ideal loss curve you shared, the model manages to produce generation that are coherent with Guanaco dataset
# input: ### Human: Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research.### Assistant:

>>> '### Human: Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research.### Assistant: Monopsony is a market structure where there is only one buyer of a good or service. In the context of the labour market, a monopsony occurs when there is only one employer in a particular industry or region. This can happen for a variety of reasons, such as government regulation, natural monopolies, or the existence of a single large firm that dominates the market.\n\nThe concept of monopsony in the labour market has gained increasing attention in recent years'
Model weights here: https://huggingface.co/ybelkada/mistral-7b-guanaco

What @bdytx5 said makes sense, there might be some differences between original model's logits and ours, indeed HF version uses 1e-5: https://huggingface.co/mistralai/Mistral-7B-v0.1/blob/main/config.json#L16 whereas mistral uses 1e-6: https://github.com/mistralai/mistral-src/blob/main/mistral/model.py#L129

@teknium1 can you try to run a training with this version of the model instead: https://huggingface.co/mistralai/Mistral-7B-v0.1/discussions/35 just pass revision="refs/pr/35" when calling from_pretrained

Next time I try a full finetune I will. I actually did succeed at training airoboros' dataset over mistral 7b, with a qlora. Leading me to one of two conclusions:

One (or more) of the datasets for hermes 2.0 is malformed, or, qlora is the only way to get the reliable training/good loss curves that I want atm. Will try with the revision next full finetune I try.

teknium1 commented 11 months ago

On a side note about Mistral, @younesbelkada,

When I inference 7b Mistral on a 4090, with just 2k max seq length, It uses >24gb of vram. It hits 23.3GB of vram used then starts offloading to CPU.

The code I run to make this happen:

import torch#, json, os, sys
import time
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import LlamaTokenizer, LlamaForCausalLM, MistralForCausalLM
#import bitsandbytes

tokenizer = LlamaTokenizer.from_pretrained('./collectivecognition-run6', trust_remote_code=True)
model = MistralForCausalLM.from_pretrained(
    "./collectivecognition-run6",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    load_in_8bit=False
    #trust_remote_code=True
)
benchmarks = [
    "Hello, tell me about the history of the United States",
    "Roleplay as a scientist, who just discovered artificial general intelligence. What do you think about this discovery? What possibilities are there now?"]

index = 0
for obj in benchmarks:

    index += 1
    if index < 1:
        continue
    else:
        start_time = time.time()  # Start timing
        prompt = f"USER:\n{obj}\n\nASSISTANT:\n"
        input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")
        generated_ids = model.generate(input_ids, max_new_tokens=2048, temperature=None)#, do_sample=True, eos_token_id=tokenizer.eos_token_id)
        response = tokenizer.decode(generated_ids[0][input_ids.shape[-1]:], skip_special_tokens=True, clean_up_tokenization_space=True)
        print(f"Response  {index}: {response}")

        end_time = time.time()  # End timing
        elapsed_time = end_time - start_time  # Calculate time taken for the iteration
        print(f"Time taken for Response {index}: {elapsed_time:.4f} seconds")
        print(f"tokens total: {len(tokenizer.encode(response))}")

younesbelkada commented 11 months ago

@teknium1 I believe because the vanilla implementation we have currently in transformers does not allow cache slicing as per the original repository. To benefit from fixed-size cache and memory efficient generation, you can use the Flash Attention 2 version of the model

import torch#, json, os, sys
import time
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import LlamaTokenizer, LlamaForCausalLM, MistralForCausalLM
#import bitsandbytes

tokenizer = LlamaTokenizer.from_pretrained('./collectivecognition-run6', trust_remote_code=True)
model = MistralForCausalLM.from_pretrained(
    "./collectivecognition-run6",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    use_flash_attention_2=True
)
benchmarks = [
    "Hello, tell me about the history of the United States",
    "Roleplay as a scientist, who just discovered artificial general intelligence. What do you think about this discovery? What possibilities are there now?"]

index = 0
for obj in benchmarks:

    index += 1
    if index < 1:
        continue
    else:
        start_time = time.time()  # Start timing
        prompt = f"USER:\n{obj}\n\nASSISTANT:\n"
        input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")
        generated_ids = model.generate(input_ids, max_new_tokens=2048, temperature=None)#, do_sample=True, eos_token_id=tokenizer.eos_token_id)
        response = tokenizer.decode(generated_ids[0][input_ids.shape[-1]:], skip_special_tokens=True, clean_up_tokenization_space=True)
        print(f"Response  {index}: {response}")

        end_time = time.time()  # End timing
        elapsed_time = end_time - start_time  # Calculate time taken for the iteration
        print(f"Time taken for Response {index}: {elapsed_time:.4f} seconds")
        print(f"tokens total: {len(tokenizer.encode(response))}")

Check the results of my benchmark here: https://github.com/huggingface/transformers/pull/26464#issuecomment-1743273513

younesbelkada commented 11 months ago

@teknium1 for full fine-tuning with DS how do you create the packed dataset ? Do you use the SFTTrainer with packing=True ? See this PR from @lewtun : https://huggingface.co/mistralai/Mistral-7B-v0.1/discussions/26 and https://twitter.com/jon_durbin/status/1709147204915523929?s=20 for reference. If you use the SFTTrainer the eos token is correctly added at the end of each chunk: https://github.com/huggingface/trl/blob/main/trl/trainer/utils.py#L577 but if you pre-tokenize your dataset manually you will never get any EOS token properly encoded I think

teknium1 commented 11 months ago

@teknium1 for full fine-tuning with DS how do you create the packed dataset ? Do you use the SFTTrainer with packing=True ? See this PR from @lewtun : https://huggingface.co/mistralai/Mistral-7B-v0.1/discussions/26 and https://twitter.com/jon_durbin/status/1709147204915523929?s=20 for reference. If you use the SFTTrainer the eos token is correctly added at the end of each chunk: https://github.com/huggingface/trl/blob/main/trl/trainer/utils.py#L577 but if you pre-tokenize your dataset manually you will never get any EOS token properly encoded I think

I use Axolotl, but @winglian would have to explain the implementation. However, when I print the tokenized dataset with axolotl, it appears fine. Also I have done several QLora's now with axolotl on Mistral, with new datasets so far, and they are turning out perfect, like, astounding, so its either dataset specific (all 3 of the datasets I tried full ft's on), or full-finetune only impacts

lewtun commented 11 months ago

I took a look at tuning Mistral 7B with TRL's SFTTrainer and DeepSpeed ZeRO-3 on a subset of the UltraChat dataset and the loss seems to converge as expected:

Here's a gist of the tweaks I made to the TRL example in case it's useful to others: https://gist.github.com/lewtun/b9d46e00292d9ecdd6fd9628d53c2814

Overall, I think the divergences some people are reporting could be due to dataset issues (e.g. how you format the chat template) and/or choice of hyperparameters. As far as I can tell, there is no issue in the SFTTrainer or Trainer from transformers.

vince62s commented 11 months ago

@younesbelkada

I believe because the vanilla implementation we have currently in transformers does not allow cache slicing as per the original repository. To benefit from fixed-size cache and memory efficient generation, you can use the Flash Attention 2 version of the model

Indeed there are two mechanism in Mistral Original repo 1) sliding window 2) rolling buffer cache

I have the impression that in HF you implemented only the sliding windows attention by playing only on the attention mask and ONLY at training time, which means that at inference, the full length is taken into account, am I correct ?

teknium1 commented 11 months ago

I trained several models successfully with qlora. All of them on datasets that make up hermes 2.0 - all but one turned out excellent in terms of loss graph. I removed that bad dataset, I cleaned up hermes 2.0 dramatically since then. and I return to full finetune today:

This is with 2e-5 300 warmup steps

teknium1 commented 11 months ago

Same dataset as a qlora, working perfectly fine:

teknium1 commented 11 months ago

An additional information point: @winglian said today: caseus — Today at 12:39 PM I had posted this the other day. there is something slightly amiss with mistral finetunes, my hunch is it's a transformers issue somewhere. Vastly different LR's (6.5x) but set warmup steps so they followed the same LR trajectory. One would expect that with the same LR at the same step, the loss and gradient should be identical?

teknium1 commented 11 months ago

4e-6lr fullfinetune run, still a nope:

bdytx5 commented 11 months ago

@teknium1 were you able to try the lower rms_norm?

teknium1 commented 11 months ago

@teknium1 can you try to run a training with this version of the model instead: https://huggingface.co/mistralai/Mistral-7B-v0.1/discussions/35 just pass revision="refs/pr/35" when calling from_pretrained

Will attempt to do so in about 30 mins

teknium1 commented 11 months ago

I am having what appears to be a potentially successful run.. at 1e-6 LR.. so I wont try the revision as to let this one play out for now.. but Ive never seen that low of an LR for finetuning a model with before.. will try the rev if this does end up spiking

teknium1 commented 11 months ago

welp..

will try the new revision .. lol

teknium1 commented 11 months ago

Okay, so the revision didn't help either, all the runs starting with higher loss are with the revision, one with lr 4e-6, several with 1e-6, one with gradient clipping 1.0, another with 0.3, one with weight decay 20% (vs 0.3% in all others)

the next image is it full finetuning on llama-2 13b flawlessly:

And here is the qlora on mistral with the same dataset as well:

trannhatquy commented 11 months ago

@teknium1 I don't use flash attention and set tokenizer.padding_size = right, then my loss is ok. But if using flash attention and set tokenizer.padding_size = left, it causes the loss instability. I think you should check this. (maybe the loss instability is due to flash attention code of mistral model and tokenizer.padding_size=left)

Here is some experiment that I do with tokenizer.padding_size = right and no flash attention:

Qlora (4 bit):
LoRA (no 4 bit, no 8 bit):

teknium1 commented 11 months ago

Well friends, it seems it is not mistral specific... :(

My dataset truly looks pristine to me, I cannot find any systemic or widespread issues in it. I dont know why it fails. You would think if it is the dataset, qlora would likely be unstable as well, but, maybe not. Maybe it's a change made in axolotl, but many people are training models right now with current main branch, and Ive tried with their configs/hyperparams. I'm at a loss 🤷‍♂️

arthurmensch commented 11 months ago

That's quite curious. Did you try shuffling the dataset? It looks like there may be some overfitting occurring at the beginning.

teknium1 commented 11 months ago

That's quite curious. Did you try shuffling the dataset? It looks like there may be some overfitting occurring at the beginning.

The dataset is shuffled automatically with axolotl. I have used the same shuffle seed in all runs though.

I'm willing to grant access to the dataset itself if anyone thinks they may find something me and several others who've looked at it have not, if interested

trannhatquy commented 11 months ago

@teknium1 here is my loss with full finetuning, I think the loss is decreasing suitably (although at the beginning, the loss is increased, but after a few steps, it is decreasing)

teknium1 commented 11 months ago

Well, @winglian ended up running a finetune over mistral with hermes2 dataset, (well, its running atm), and this is the loss chart now:

It looks... good. Why? I dont know. The only difference in what he is doing and what I have done is he is using deepspeed, and he has set it up for chatml format instead of traditional sharegpt/vicuna/fastchat format. As far as I can tell, that is the only difference. Will focus future ablations on those 2 factors..

vince62s commented 11 months ago

What @bdytx5 said makes sense, there might be some differences between original model's logits and ours, indeed HF version uses 1e-5: https://huggingface.co/mistralai/Mistral-7B-v0.1/blob/main/config.json#L16 whereas mistral uses 1e-6: https://github.com/mistralai/mistral-src/blob/main/mistral/model.py#L129

I think this is wrong, Mistral uses 1e-5 because it reads params.json which has 1e-5

huggingface / transformers

Mistral loss instability #26498

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

adam_epsilon: 0.00001