Closed MhDang closed 9 months ago
If the current version is still in development, would it also be possible to point to any previsous working version?
Also having the same issue, after successfully training last week
I can reproduce the issue in my case too. But I've looked at the lora script commits, and there was a recent one with big changes, and I used the previous commit, which runs fine in my case.
Download this and replace ./diffusers/examples/text_to_image/train_text_to_image_lora.py This is a temporary solution, sadly I'm not familiar with this.
I think the reason is that the Lora parameters is added to the unet after unet is send to GPU. So the LORA is actually on CPU, leading to the error. A simple way to fix this typo is to first add Lora to unet and then send them together to GPU:
Same behaviour here, falling back to the version @wellCh4n provided solved the problem.
I think the reason is that the Lora parameters is added to the unet after unet is send to GPU. So the LORA is actually on CPU, leading to the error. A simple way to fix this typo is to first add Lora to unet and then send them together to GPU:
Can confirm this fixes the error for me, however at least on Colab with a T4 runtime I then get a "Expected is_sm80 || is_sm90 to be true, but got false." error message when the script tries to backpropagate the loss. Not sure if this is an issue with the new script or some compatibility issue with the CUDA drivers in the Colab though.
This seems like a setup problem to me as I am unable to reproduce it, even on a Google Colab: https://github.com/huggingface/diffusers/issues/5004#issuecomment-1780909598
I got the same error. However, reverting to the previous version, as @wellCh4n suggested, resolved the issue.
same issue following for fix
I am gonna have to repeat myself here:
https://github.com/huggingface/diffusers/issues/5897#issuecomment-1827075282
@sayakpaul Is there anything we can do to help you reproduce this issue? Seems significant as multiple people with different setups have encountered the same issue. Otherwise we're forced to keep using this older version indefinently.
A Colab notebook would be nice because that's the easiest to reproduce. As already indicated here, I was not able to reproduce at all: https://github.com/huggingface/diffusers/issues/5897#issuecomment-1827075282.
And I am quite sure https://github.com/huggingface/diffusers/pull/5388 will resolve these problems for good.
Hopefuly this is fixed when moving to PEFT - in the meantime if you don't want to revert to an older version, I had the same issue, and fixed it by adding 1 line:
unet.to(accelerator.device, dtype=weight_dtype)
At my line 539, immediately after the LORA weights are added, and outside the loop:
# Accumulate the LoRA params to optimize.
unet_lora_parameters.extend(attn_module.to_q.lora_layer.parameters())
unet_lora_parameters.extend(attn_module.to_k.lora_layer.parameters())
unet_lora_parameters.extend(attn_module.to_v.lora_layer.parameters())
unet_lora_parameters.extend(attn_module.to_out[0].lora_layer.parameters())
unet.to(accelerator.device, dtype=weight_dtype)
Thanks to @IceClear and others that found that some of the unet was on the wrong device.
If you want to open a PR fixing it, more than happy to merge :)
@sayakpaul Thank you - I've opened #6061, let me know if it needs any modification
@sayakpaul Thank you - I've opened #6061, let me know if it needs any modification
Is this still on the progress ?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Describe the bug
I tried to experiment with LoRA training following examples/text_to_image/README.md#training-with-lora.
However, I got the error
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)
on line 801.The same issue did not occur when I was trying the the same example (with the implementation at that time) months ago. I noticed there were several commits after that.
I followed the README.md for installing packages and the non-LoRA training works well.
Thank you very much!
Reproduction
Then cd in the folder
examples/text_to_image
and runexamples/text_to_image
run the followingLogs
System Info
diffusers
version: 0.24.0.dev0Who can help?
@sayakpaul @patrickvonplaten