Am I using the code incorrectly? help me

Pang-dachu commented 10 months ago

I used this code and trained with Korean ko-snil data.

adapter_config.json, adapter_model.safetensors, special_tokens_map.json, tokenizer_config.json, tokenizer.json, tokenizer.model

5 files were saved.

I configured accelerate as shown below. I applied rola.json as it was published.

CUDA_VISIBLE_DEVICES="1" accelerate launch \
    --config_file ds_zero2_0125.yaml \
    peft_lora_embedding_semantic_search.py \
    --dataset_name similarity_Kodataset \
    --max_length 512 \
    --model_name_or_path "/home/embedding_kim/[Model]/e5-mistral-7b-instruct" \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --learning_rate 0.00005 \
    --weight_decay 0.01 \
    --num_train_epochs 4 \
    --max_train_steps 2048 \
    --gradient_accumulation_steps 512 \
    --lr_scheduler_type cosine\
    --num_warmup_steps 128 \
    --output_dir trained_ko_model_0125 \
    --with_tracking \
    --report_to "wandb" \
    --use_peft

I also saw a change in learning loss, and when I evaluated the model with my code against the STS BenchMark The benchmark scores for the model published by [intfloat] and the model I fine-tuned are the same to the same number of decimal places.

I would like to ask if your code cannot fine-tune the model published by [intfloat], or if I am missing something and need to apply additional measures after training for the results to be reflected.

(ex.. Is there a process to merge the generated adapter.json into the model further...?)

kamalkraj commented 10 months ago

The code can be used to finetune the published infloat model or the original mistral model.

To merge lora https://discuss.huggingface.co/t/help-with-merging-lora-weights-back-into-base-model/40968/4

Pang-dachu commented 10 months ago

Thank you very much for providing the fine-tuning code.

However, I have been trying to fine-tune using your code for a few days now, and the generated LoRA adapter is not having any effect on the model.

I have tried fine-tuning both 1) intfloat's model and 2) the base model of intfloat's model, but applying the LoRA adapter made no difference.

=== The data used for training is Korean dataset The benchmark I use for evaluation is MTEB STS17 (ko-ko).

Am I doing something wrong in my approach or thinking ?

bjelkenhed commented 10 months ago

Thank you for providing this code, however I get the same results that is described by Pang-dachu above. The trained LoRA adapters does not seem to have any effect on the response from the model. Have tried with merge and unload before saving the model and loading it again but the result is alway exactly the same as with the e5-instruct-7b basemodel.

Rinatum commented 10 months ago

@Pang-dachu @bjelkenhed I got the same problem for my custom dataset

the reason was that accelerate config provided by @kamalkraj is not suitable for my machine. I only start to work with accelerate so I don't know which fields in the config are wrong but I have universal solution. Just run

accelerate launch --mixed_precision="fp16" peft_lora_embedding_semantic_search.py ...

It will be use default accelerate parameters for your machine.

Also the here is selfcheck :

                    lora_params = {n: p for n, p in model.named_parameters() if "lora" in n}
                    for n, p in lora_params.items():
                        accelerator.print(n, p.sum())

past this in your trainloop to check that your Lora_B parameters are not zero

I hope it will be help you

Pang-dachu commented 10 months ago

@Rinatum

There are quite a few floors with zero values of LoRA_B. How would you recommend approaching and solving this problem?

ps : I noticed it today as well, and it seems that in the learning process, lr goes to zero when the LoRA_B layer goes to zero.

Rinatum commented 10 months ago

@Pang-dachu

try to train with default paramters of accelerate (one machine, one gpu, fp16)
be careful when you will load learned lora

THIS ONE DOESN'T WORK:

model = AutoModel.from_pretrained('intfloat/e5-mistral-7b-instruct')
model = PeftModel.from_pretrained(model, "path-to-lora")

THIS ONE WORK:

model = MistralForSequenceEmbedding.from_pretrained('intfloat/e5-mistral-7b-instruct')
model = PeftModel.from_pretrained(model, "path-to-lora")

Pang-dachu commented 10 months ago

@Rinatum

I currently have accelerate configured like this: (deepspeed not used)

NCCL_P2P_LEVEL=NVL CUDA_VISIBLE_DEVICES="0" accelerate launch \
--mixed_precision="bf16" \
peft_lora_embedding_semantic_search.py \
--dataset_name custom_data_path \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--model_name_or_path local_model_path\ 
--output_dir output_dir \
--use_peft

I proceeded with the class "MistralForSequenceEmbedding.from_pretrained", which is also defined in the peft_lora_embedding_semantic_search code. (The difference is that it is imported into bf16)
I tried to train it as you suggested, but I got the same result of 0 for LoRA_B.

Total number of LoRA layers (A+B) : 448
LoRA_A count : 224
LoRA_B count : 224
LoRA_A zero weight count : 0
LoRA_B zero weight Count : 224

Pang-dachu commented 10 months ago

@Rinatum

I think I saw a glimpse of hope, but I need to verify it.

I'll try again and talk about the results.

Rinatum commented 10 months ago

@Pang-dachu could you check lora_B weights during training?

Pang-dachu commented 9 months ago

@Rinatum

I've been playing around with this for about two weeks now, so I don't remember exactly what the conditions were.

I probably just used the code provided in this GitHub. I think I took a picture of the tensors of the LoRA layer during the learning process.

There were cases where all the tensor values in the LoRA_B layer were 0, and I think the lr value suddenly changed to 0. (Rather than looking at all the tensors, I think it would be better to apply sum() as you suggested).

=== For now, as you suggested

Using a single GPU
Not using DeepSpeed
LoRA merge using Mistral Embedding Classes I've noticed a performance change in the benchmarks with custom data when using the above conditions.

However, it takes a long time using a single GPU, so my goal is to apply multi-GPU or DeepSpeed since the training and data size is small. (It's not easy, but...)

p.s : I've been struggling for about 2 weeks now, and I'm so grateful for the hope.

bjelkenhed commented 9 months ago

Thank you @Rinatum for all your suggestions. Trying now with something similar as @Pang-dachu without Deepspeed and using the MistralForSequenceEmbedding when loading the model and it looks promising so far. The results differ from using the base model e5-mistral-7b-instruct at least for the first time. Using bitsandbytes QLora instead now and that seems to work fine.

Will have confirmed results tomorrow.

Pang-dachu commented 9 months ago

current progress.

Using accelerator
LoRA merge using Mistral Embedding Classes
Applying ZeRO-2 for multi-GPU training
Loading a model in bfloat-16 format from training to LoRA merge

Load the model using Mistral Embedding Classes for merge_and_unload for the trained model, and use Mistral Embedding Classes for the merged model. For the merged model, I checked that it can be used by importing it into AutoModel.

ZeRO-3 application failed and LoRA-B layer becomes 0 when applied.

@Rinatum I've been struggling for almost 3 weeks and this has solved a huge problem for me, thank you so much.

@bjelkenhed I'm going to need to test it out in my environment for a few different situations this week. Can you share any successes or anything else unusual ?

Rinatum commented 9 months ago

@Pang-dachu @bjelkenhed

So nice ! I Also recommend you to delete standart model saving hooks and accelerator.save_state

Use this one instead of:

                    # accelerator.save_state(output_dir)
                    unwrapped_model = accelerator.unwrap_model(model)
                    unwrapped_model.save_pretrained(
                        output_dir,
                        is_main_process=accelerator.is_main_process,
                        save_function=accelerator.save,
                    )

It allows you to save only lora weigths and this weigths will be not zero

Also i figure out that multi-gpu training using deepspeed depends on you gpu card

I have A100 so I can use bf16 but you can try different options for you case

By the way, I can conclude that the main problem of zero lora weights is using AutoModel.from_pretrained

In this case there is only one correct option is to use original model class exactly (MistralForSequenceEmbedding)

bjelkenhed commented 9 months ago

Hi, here are some updates from me.

Without Deepspeed ZeRO-3 it works much better and no LoRA layers get zeros. That makes the training work as expected and the results differ from the base model as excected. I have H100s with 80GB and have used Bitsandbytes with 4bits so far, but will try with ZeRO-2 without Bitsandbytes as well. If you would like to share your config for ZeRO-2 @Pang-dachu it would be appreciated.

So far I only have approx 10 000 examples in my trainingset and so far the evaluation results are not better than with the base model e5-mistral-7b-instruct in a hitrate evaluation with a evaluationset resembling ms marco format. What batch size do you use and how large is your datasets? I am currently using smaller batch size, but I don't know what would be the best one considering the size of the dataset.

kamalkraj / e5-mistral-7b-instruct

Am I using the code incorrectly? help me #6