Closed Mihaiii closed 3 months ago
@ChenRocks can you please look into this issue with the latest model release
Thanks @Mihaiii for reporting this. Could you check if #102 solves your issue?
@ChenRocks Thank you for allowing the vision model to not be freezed when doing LORA finetuning!
I tested and it works ok as long as I do not pass the "freeze_vision_model" argument. Before this change, when LORA finetuning, we had to freeze the vision model. After this change, when LORA finetuning, we have to unfreeze the vision model (otherwise we get the above error). I think the intended behavior is to let the user decide.
@ChenRocks are you happy to close this issue based on @Mihaiii comments?
@Mihaiii is there further issues regarding this? Other wise we may close this. Thanks!
As mentioned above, users won't be able to use the "freeze_vision_model" argument when doing lora. If it's intentional, then yes, sure, please close and thank you for the update. 🙌
Seems there are some misunderstanding here. --freeze_vision_model
is still allowed with LoRA. Does it not work for you?
@ChenRocks no, it does not. I get the error from the initial message[1] when I'm using it. Does it work for you? Maybe I did something wrong.
[1] RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one.[...]
I see. I only tested with single GPU. Just found that it does break with multi-GPU LoRA+freeze_vision. Let me investigate this
Adding this option EDIT: It makes the program run but training is still incorrect.ddp_find_unused_parameters=(args.use_lora and args.freeze_vision_model),
to the training_args
should be a good temporary workaround for now.
In the meanwhile, I'll investigate if there's a more principled solution.
I think I tried something like that in the past and still did not work. I'll try again if you want me to, though. Fwiw, im my case it won't work even if I'm using only one GPU, not multiple. I know it's a parallel error, that's why it's strange.
You're right, there are some deeper issue. I'll keep working on a solution.
@ChenRocks @leestott
I tested the latest changes and now the script seems to work ok with or without the vision model frozen.
@ChenRocks Thanks again for making the changes to allow finetuning the vision model - I did a comparison on results and I get better results now.
I'm gonna go ahead an close this.
Alright, so I spent several hours thinking it was a dependency issue, only to discover that it was actually a change in the model itself that was uploaded to Huggingface. :/
To confirm, I just changed the training script to load the model at revision='f998a184b56bf0399b3af85c50b20ec0d5688f5f', and now it works like a charm.
See the details below.
This issue is for a: (mark with an
x
)Minimal steps to reproduce
Any log messages given by the failure
OS and Version?
More info:
I use a single card. Here is the nvidia-smi output: