Open Pang-dachu opened 10 months ago
The code can be used to finetune the published infloat model or the original mistral model.
To merge lora https://discuss.huggingface.co/t/help-with-merging-lora-weights-back-into-base-model/40968/4
Thank you very much for providing the fine-tuning code.
However, I have been trying to fine-tune using your code for a few days now, and the generated LoRA adapter is not having any effect on the model.
I have tried fine-tuning both 1) intfloat's model and 2) the base model of intfloat's model, but applying the LoRA adapter made no difference.
=== The data used for training is Korean dataset The benchmark I use for evaluation is MTEB STS17 (ko-ko).
Am I doing something wrong in my approach or thinking ?
Thank you for providing this code, however I get the same results that is described by Pang-dachu above. The trained LoRA adapters does not seem to have any effect on the response from the model. Have tried with merge and unload before saving the model and loading it again but the result is alway exactly the same as with the e5-instruct-7b basemodel.
@Pang-dachu @bjelkenhed I got the same problem for my custom dataset
the reason was that accelerate config provided by @kamalkraj is not suitable for my machine. I only start to work with accelerate so I don't know which fields in the config are wrong but I have universal solution. Just run
accelerate launch --mixed_precision="fp16" peft_lora_embedding_semantic_search.py ...
It will be use default accelerate parameters for your machine.
Also the here is selfcheck :
lora_params = {n: p for n, p in model.named_parameters() if "lora" in n}
for n, p in lora_params.items():
accelerator.print(n, p.sum())
past this in your trainloop to check that your Lora_B parameters are not zero
I hope it will be help you
@Rinatum
There are quite a few floors with zero values of LoRA_B. How would you recommend approaching and solving this problem?
@Pang-dachu
THIS ONE DOESN'T WORK:
model = AutoModel.from_pretrained('intfloat/e5-mistral-7b-instruct')
model = PeftModel.from_pretrained(model, "path-to-lora")
THIS ONE WORK:
model = MistralForSequenceEmbedding.from_pretrained('intfloat/e5-mistral-7b-instruct')
model = PeftModel.from_pretrained(model, "path-to-lora")
@Rinatum
I currently have accelerate configured like this: (deepspeed not used)
NCCL_P2P_LEVEL=NVL CUDA_VISIBLE_DEVICES="0" accelerate launch \
--mixed_precision="bf16" \
peft_lora_embedding_semantic_search.py \
--dataset_name custom_data_path \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--model_name_or_path local_model_path\
--output_dir output_dir \
--use_peft
I proceeded with the class "MistralForSequenceEmbedding.from_pretrained", which is also defined in the peft_lora_embedding_semantic_search code. (The difference is that it is imported into bf16)
I tried to train it as you suggested, but I got the same result of 0 for LoRA_B.
Total number of LoRA layers (A+B) : 448
LoRA_A count : 224
LoRA_B count : 224
LoRA_A zero weight count : 0
LoRA_B zero weight Count : 224
@Rinatum
I think I saw a glimpse of hope, but I need to verify it.
I'll try again and talk about the results.
@Pang-dachu could you check lora_B weights during training?
@Rinatum
I've been playing around with this for about two weeks now, so I don't remember exactly what the conditions were.
I probably just used the code provided in this GitHub. I think I took a picture of the tensors of the LoRA layer during the learning process.
There were cases where all the tensor values in the LoRA_B layer were 0, and I think the lr value suddenly changed to 0. (Rather than looking at all the tensors, I think it would be better to apply sum() as you suggested).
=== For now, as you suggested
However, it takes a long time using a single GPU, so my goal is to apply multi-GPU or DeepSpeed since the training and data size is small. (It's not easy, but...)
p.s : I've been struggling for about 2 weeks now, and I'm so grateful for the hope.
Thank you @Rinatum for all your suggestions. Trying now with something similar as @Pang-dachu without Deepspeed and using the MistralForSequenceEmbedding when loading the model and it looks promising so far. The results differ from using the base model e5-mistral-7b-instruct at least for the first time. Using bitsandbytes QLora instead now and that seems to work fine.
Will have confirmed results tomorrow.
current progress.
Load the model using Mistral Embedding Classes for merge_and_unload for the trained model, and use Mistral Embedding Classes for the merged model. For the merged model, I checked that it can be used by importing it into AutoModel.
ZeRO-3 application failed and LoRA-B layer becomes 0 when applied.
@Rinatum I've been struggling for almost 3 weeks and this has solved a huge problem for me, thank you so much.
@bjelkenhed I'm going to need to test it out in my environment for a few different situations this week. Can you share any successes or anything else unusual ?
@Pang-dachu @bjelkenhed
So nice ! I Also recommend you to delete standart model saving hooks and accelerator.save_state
Use this one instead of:
# accelerator.save_state(output_dir)
unwrapped_model = accelerator.unwrap_model(model)
unwrapped_model.save_pretrained(
output_dir,
is_main_process=accelerator.is_main_process,
save_function=accelerator.save,
)
It allows you to save only lora weigths and this weigths will be not zero
Also i figure out that multi-gpu training using deepspeed depends on you gpu card
I have A100 so I can use bf16 but you can try different options for you case
By the way, I can conclude that the main problem of zero lora weights is using AutoModel.from_pretrained
In this case there is only one correct option is to use original model class exactly (MistralForSequenceEmbedding)
Hi, here are some updates from me.
Without Deepspeed ZeRO-3 it works much better and no LoRA layers get zeros. That makes the training work as expected and the results differ from the base model as excected. I have H100s with 80GB and have used Bitsandbytes with 4bits so far, but will try with ZeRO-2 without Bitsandbytes as well. If you would like to share your config for ZeRO-2 @Pang-dachu it would be appreciated.
So far I only have approx 10 000 examples in my trainingset and so far the evaluation results are not better than with the base model e5-mistral-7b-instruct in a hitrate evaluation with a evaluationset resembling ms marco format. What batch size do you use and how large is your datasets? I am currently using smaller batch size, but I don't know what would be the best one considering the size of the dataset.
I used this code and trained with Korean ko-snil data.
adapter_config.json, adapter_model.safetensors, special_tokens_map.json, tokenizer_config.json, tokenizer.json, tokenizer.model
5 files were saved.
I configured accelerate as shown below. I applied rola.json as it was published.
I also saw a change in learning loss, and when I evaluated the model with my code against the STS BenchMark The benchmark scores for the model published by [intfloat] and the model I fine-tuned are the same to the same number of decimal places.
I would like to ask if your code cannot fine-tune the model published by [intfloat], or if I am missing something and need to apply additional measures after training for the results to be reflected.
(ex.. Is there a process to merge the generated adapter.json into the model further...?)