Mihaiii commented 3 months ago

Alright, so I spent several hours thinking it was a dependency issue, only to discover that it was actually a change in the model itself that was uploaded to Huggingface. :/

To confirm, I just changed the training script to load the model at revision='f998a184b56bf0399b3af85c50b20ec0d5688f5f', and now it works like a charm.

See the details below.

Please provide us with the following information:

This issue is for a: (mark with an `x`)

- [ x ] bug report -> please search issues before submitting
- [ ] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

Follow the steps for lora finetune mentioned here: https://github.com/microsoft/Phi-3CookBook/blob/a3e5e48d6a56cb49b9a34b35331a41aa3f42ba3d/md/04.Fine-tuning/FineTuning_Vision.md Note that I only tried the DocVQA script.

Any log messages given by the failure

/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
Traceback (most recent call last):
File "/workspace/train.py", line 570, in <module>
main()
File "/workspace/train.py", line 514, in main
trainer.train()
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1624, in train
return inner_training_loop(
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1961, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2902, in training_step
loss = self.compute_loss(model, inputs)
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2925, in compute_loss
outputs = model(**inputs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1139, in forward
if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
making sure all `forward` function outputs participate in calculating loss. 
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 0: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 ...

Expected/desired behavior

no errors

OS and Version?

Linux Ubuntu

More info:

I use a single card. Here is the nvidia-smi output:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100 80GB PCIe          On  |   00000000:86:00.0 Off |                    0 |
| N/A   36C    P0             43W /  300W |       0MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

leestott commented 3 months ago

@ChenRocks can you please look into this issue with the latest model release

ChenRocks commented 3 months ago

Thanks @Mihaiii for reporting this. Could you check if #102 solves your issue?

Mihaiii commented 3 months ago

@ChenRocks Thank you for allowing the vision model to not be freezed when doing LORA finetuning!

I tested and it works ok as long as I do not pass the "freeze_vision_model" argument. Before this change, when LORA finetuning, we had to freeze the vision model. After this change, when LORA finetuning, we have to unfreeze the vision model (otherwise we get the above error). I think the intended behavior is to let the user decide.

leestott commented 3 months ago

@ChenRocks are you happy to close this issue based on @Mihaiii comments?

ChenRocks commented 3 months ago

@Mihaiii is there further issues regarding this? Other wise we may close this. Thanks!

Mihaiii commented 3 months ago

As mentioned above, users won't be able to use the "freeze_vision_model" argument when doing lora. If it's intentional, then yes, sure, please close and thank you for the update. 🙌

ChenRocks commented 3 months ago

Seems there are some misunderstanding here. --freeze_vision_model is still allowed with LoRA. Does it not work for you?

Mihaiii commented 3 months ago

@ChenRocks no, it does not. I get the error from the initial message[1] when I'm using it. Does it work for you? Maybe I did something wrong.

[1] RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one.[...]

ChenRocks commented 3 months ago

I see. I only tested with single GPU. Just found that it does break with multi-GPU LoRA+freeze_vision. Let me investigate this

ChenRocks commented 3 months ago

~~Adding this option ddp_find_unused_parameters=(args.use_lora and args.freeze_vision_model), to the training_args should be a good temporary workaround for now.~~ EDIT: It makes the program run but training is still incorrect.

In the meanwhile, I'll investigate if there's a more principled solution.

Mihaiii commented 3 months ago

I think I tried something like that in the past and still did not work. I'll try again if you want me to, though. Fwiw, im my case it won't work even if I'm using only one GPU, not multiple. I know it's a parallel error, that's why it's strange.

ChenRocks commented 3 months ago

You're right, there are some deeper issue. I'll keep working on a solution.

ChenRocks commented 3 months ago

103 should work for now. I'll run some training jobs to make sure everything is fixed. After that we'll merge to main.

Mihaiii commented 3 months ago

@ChenRocks @leestott

I tested the latest changes and now the script seems to work ok with or without the vision model frozen.

@ChenRocks Thanks again for making the changes to allow finetuning the vision model - I did a comparison on results and I get better results now.

I'm gonna go ahead an close this.

microsoft / Phi-3CookBook

Lora finetuning script for Phi-3-vision-128k-instruct no longer works with latest revision of the model #101

Please provide us with the following information:

This issue is for a: (mark with an `x`)

Minimal steps to reproduce

Any log messages given by the failure

Expected/desired behavior

OS and Version?

103 should work for now. I'll run some training jobs to make sure everything is fixed. After that we'll merge to main.

microsoft / Phi-3CookBook

Lora finetuning script for Phi-3-vision-128k-instruct no longer works with latest revision of the model #101

Please provide us with the following information:

This issue is for a: (mark with an x)

Minimal steps to reproduce

Any log messages given by the failure

Expected/desired behavior

OS and Version?

103 should work for now. I'll run some training jobs to make sure everything is fixed. After that we'll merge to main.

This issue is for a: (mark with an `x`)