microsoft / Phi-3CookBook

This is a Phi-3 book for getting started with Phi-3. Phi-3, a family of open sourced AI models developed by Microsoft. Phi-3 models are the most capable and cost-effective small language models (SLMs) available, outperforming models of the same size and next size up across a variety of language, reasoning, coding, and math benchmarks.
MIT License
2.51k stars 260 forks source link

Reproducibility issue for finetuning Phi3 Vision on DocVQA dataset #121

Open qwedaq opened 3 months ago

qwedaq commented 3 months ago

This issue is for a: (mark with an x)

- [x] bug report -> please search issues before submitting
- [ ] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

I used the following command to finetune Phi 3 vision using LoRA CUDA_VISIBLE_DEVICES=6 python3 finetune_hf_trainer_docvqa.py --full_train --use_lora --bf16 --lora_rank=32 --lora_alpha_ratio=16 --batch_size=64 --learning_rate=2e-4 --num_train_epochs=2 --freeze_vision_model

Any log messages given by the failure

Screenshot (252)

Expected/desired behavior

The reported ANLS after finetuning in the readme is 82.46, the ANLS I got is 75.68. Infact the ANLS score before finetuning is 77.02.

OS and Version?

Linux with CUDA 12.2

leestott commented 3 months ago

@qwedaq sorry can you confirm which sample your running from the cookbook?

qwedaq commented 3 months ago

@qwedaq sorry can you confirm which sample your running from the cookbook?

Hi @leestott, I am running the following script for Phi3 Vision from the cookbook

https://github.com/microsoft/Phi-3CookBook/blob/main/code/04.Finetuning/vision_finetuning/finetune_hf_trainer_docvqa.py

leestott commented 3 months ago

@ChenRocks please can you look into this with your finetuning sample

ChenRocks commented 3 months ago

Hi @qwedaq, thanks for reporting your results. Note that all deep learning training has inherent randomness; therefore, it is possible that a re-run results in slight accuracy difference.

However, in your case, the drop is significant. The reason is this --lora_alpha_ratio=16 hyper parameter. The correct way of setting lora_alpha to 16 is --lora_alpha_ratio=0.5. See this line.

I know this may not be obvious for users. I will improve the document later. Thanks!

qwedaq commented 3 months ago

Hi @qwedaq, thanks for reporting your results. Note that all deep learning training has inherent randomness; therefore, it is possible that a re-run results in slight accuracy difference.

However, in your case, the drop is significant. The reason is this --lora_alpha_ratio=16 hyper parameter. The correct way of setting lora_alpha to 16 is --lora_alpha_ratio=0.5. See this line.

I know this may not be obvious for users. I will improve the document later. Thanks!

This is working now. I am able to reproduce the results. Thank you

qwedaq commented 2 months ago

I just had quick question related to the same code. I would like to know why Phi3V reports the final results using ANLS metric and does not use more modern metrics such BLEU, BERT or ROUGE-L?