kamalkraj / e5-mistral-7b-instruct

Finetune mistral-7b-instruct for sentence embeddings
Apache License 2.0
61 stars 13 forks source link

Am I using the code incorrectly? help me #6

Open Pang-dachu opened 5 months ago

Pang-dachu commented 5 months ago

I used this code and trained with Korean ko-snil data.

adapter_config.json, adapter_model.safetensors, special_tokens_map.json, tokenizer_config.json, tokenizer.json, tokenizer.model

5 files were saved.

I configured accelerate as shown below. I applied rola.json as it was published.

CUDA_VISIBLE_DEVICES="1" accelerate launch \
    --config_file ds_zero2_0125.yaml \
    peft_lora_embedding_semantic_search.py \
    --dataset_name similarity_Kodataset \
    --max_length 512 \
    --model_name_or_path "/home/embedding_kim/[Model]/e5-mistral-7b-instruct" \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --learning_rate 0.00005 \
    --weight_decay 0.01 \
    --num_train_epochs 4 \
    --max_train_steps 2048 \
    --gradient_accumulation_steps 512 \
    --lr_scheduler_type cosine\
    --num_warmup_steps 128 \
    --output_dir trained_ko_model_0125 \
    --with_tracking \
    --report_to "wandb" \
    --use_peft

I also saw a change in learning loss, and when I evaluated the model with my code against the STS BenchMark The benchmark scores for the model published by [intfloat] and the model I fine-tuned are the same to the same number of decimal places.

I would like to ask if your code cannot fine-tune the model published by [intfloat], or if I am missing something and need to apply additional measures after training for the results to be reflected.

(ex.. Is there a process to merge the generated adapter.json into the model further...?)

kamalkraj commented 5 months ago

The code can be used to finetune the published infloat model or the original mistral model.

To merge lora https://discuss.huggingface.co/t/help-with-merging-lora-weights-back-into-base-model/40968/4

Pang-dachu commented 4 months ago

Thank you very much for providing the fine-tuning code.

However, I have been trying to fine-tune using your code for a few days now, and the generated LoRA adapter is not having any effect on the model.

I have tried fine-tuning both 1) intfloat's model and 2) the base model of intfloat's model, but applying the LoRA adapter made no difference.

=== The data used for training is Korean dataset The benchmark I use for evaluation is MTEB STS17 (ko-ko).

Am I doing something wrong in my approach or thinking ?

bjelkenhed commented 4 months ago

Thank you for providing this code, however I get the same results that is described by Pang-dachu above. The trained LoRA adapters does not seem to have any effect on the response from the model. Have tried with merge and unload before saving the model and loading it again but the result is alway exactly the same as with the e5-instruct-7b basemodel.

Rinatum commented 4 months ago

@Pang-dachu @bjelkenhed I got the same problem for my custom dataset

the reason was that accelerate config provided by @kamalkraj is not suitable for my machine. I only start to work with accelerate so I don't know which fields in the config are wrong but I have universal solution. Just run

accelerate launch --mixed_precision="fp16" peft_lora_embedding_semantic_search.py ...

It will be use default accelerate parameters for your machine.

Also the here is selfcheck :

                    lora_params = {n: p for n, p in model.named_parameters() if "lora" in n}
                    for n, p in lora_params.items():
                        accelerator.print(n, p.sum())

past this in your trainloop to check that your Lora_B parameters are not zero

I hope it will be help you

Pang-dachu commented 4 months ago

@Rinatum

There are quite a few floors with zero values of LoRA_B. How would you recommend approaching and solving this problem?

Rinatum commented 4 months ago

@Pang-dachu

THIS ONE DOESN'T WORK:

model = AutoModel.from_pretrained('intfloat/e5-mistral-7b-instruct')
model = PeftModel.from_pretrained(model, "path-to-lora")

THIS ONE WORK:

model = MistralForSequenceEmbedding.from_pretrained('intfloat/e5-mistral-7b-instruct')
model = PeftModel.from_pretrained(model, "path-to-lora")
Pang-dachu commented 4 months ago

@Rinatum

Total number of LoRA layers (A+B) : 448
LoRA_A count : 224
LoRA_B count : 224
LoRA_A zero weight count : 0
LoRA_B zero weight Count : 224
Pang-dachu commented 4 months ago

@Rinatum

I think I saw a glimpse of hope, but I need to verify it.

I'll try again and talk about the results.

Rinatum commented 4 months ago

@Pang-dachu could you check lora_B weights during training?

Pang-dachu commented 4 months ago

@Rinatum

I've been playing around with this for about two weeks now, so I don't remember exactly what the conditions were.

I probably just used the code provided in this GitHub. I think I took a picture of the tensors of the LoRA layer during the learning process.

There were cases where all the tensor values in the LoRA_B layer were 0, and I think the lr value suddenly changed to 0. (Rather than looking at all the tensors, I think it would be better to apply sum() as you suggested).

=== For now, as you suggested

However, it takes a long time using a single GPU, so my goal is to apply multi-GPU or DeepSpeed since the training and data size is small. (It's not easy, but...)

p.s : I've been struggling for about 2 weeks now, and I'm so grateful for the hope.

bjelkenhed commented 4 months ago

Thank you @Rinatum for all your suggestions. Trying now with something similar as @Pang-dachu without Deepspeed and using the MistralForSequenceEmbedding when loading the model and it looks promising so far. The results differ from using the base model e5-mistral-7b-instruct at least for the first time. Using bitsandbytes QLora instead now and that seems to work fine.

Will have confirmed results tomorrow.

Pang-dachu commented 4 months ago

current progress.

Load the model using Mistral Embedding Classes for merge_and_unload for the trained model, and use Mistral Embedding Classes for the merged model. For the merged model, I checked that it can be used by importing it into AutoModel.

ZeRO-3 application failed and LoRA-B layer becomes 0 when applied.

@Rinatum I've been struggling for almost 3 weeks and this has solved a huge problem for me, thank you so much.

@bjelkenhed I'm going to need to test it out in my environment for a few different situations this week. Can you share any successes or anything else unusual ?

Rinatum commented 4 months ago

@Pang-dachu @bjelkenhed

So nice ! I Also recommend you to delete standart model saving hooks and accelerator.save_state

Use this one instead of:

                    # accelerator.save_state(output_dir)
                    unwrapped_model = accelerator.unwrap_model(model)
                    unwrapped_model.save_pretrained(
                        output_dir,
                        is_main_process=accelerator.is_main_process,
                        save_function=accelerator.save,
                    )

It allows you to save only lora weigths and this weigths will be not zero

bjelkenhed commented 4 months ago

Hi, here are some updates from me.

Without Deepspeed ZeRO-3 it works much better and no LoRA layers get zeros. That makes the training work as expected and the results differ from the base model as excected. I have H100s with 80GB and have used Bitsandbytes with 4bits so far, but will try with ZeRO-2 without Bitsandbytes as well. If you would like to share your config for ZeRO-2 @Pang-dachu it would be appreciated.

So far I only have approx 10 000 examples in my trainingset and so far the evaluation results are not better than with the base model e5-mistral-7b-instruct in a hitrate evaluation with a evaluationset resembling ms marco format. What batch size do you use and how large is your datasets? I am currently using smaller batch size, but I don't know what would be the best one considering the size of the dataset.