mkshing / ziplora-pytorch

Implementation of "ZipLoRA: Any Subject in Any Style by Effectively Merging LoRAs"
MIT License
476 stars 33 forks source link

Breaking wth `--enable_xformers_memory_efficient_attention \` #4

Open guivr opened 7 months ago

guivr commented 7 months ago

Hi! Thank you so much for this.

I'm trying to run this on Google Colab but I'm always running into the "CUDA out of memory" error.

I've tried adding:

+  --enable_xformers_memory_efficient_attention \
+  --gradient_checkpointing \
+  --use_8bit_adam \
+  --mixed_precision="fp16" \

but it's breaking with:

--enable_xformers_memory_efficient_attention \

error:

ValueError: Query/Key/Value should either all have the same dtype, or (in the quantized case) Key/Value should have dtype torch.int32
  query.dtype: torch.float32
  key.dtype  : torch.float16
  value.dtype: torch.float16

once I remove this argument it works, but always fails at the last step, like 1000/1000 because out of memory (16GB limit, V100)

jinnsp commented 7 months ago

Are you working with PyTorch 2.1? This might be linked to an issue you can find here: https://github.com/huggingface/diffusers/issues/5484

Also, if it fails at the end, it's likely just a validation issue. Training is probably done (check your LoRA save path), if so, you can ignore it, or just remove validation_~ in arguments.

guivr commented 7 months ago

Are you working with PyTorch 2.1? This might be linked to an issue you can find here: huggingface/diffusers#5484

Also, if it fails at the end, it's likely just a validation issue. Training is probably done (check your LoRA save path), if so, you can ignore it, or just remove validation_~ in arguments.

Yep, PyTorch 2.1. What version should it be?

euminds commented 7 months ago

Are you working with PyTorch 2.1? This might be linked to an issue you can find here: huggingface/diffusers#5484 Also, if it fails at the end, it's likely just a validation issue. Training is probably done (check your LoRA save path), if so, you can ignore it, or just remove validation_~ in arguments.

Yep, PyTorch 2.1. What version should it be?

Without enable_xformers_memory_efficient_attention flag training works fine.

guivr commented 7 months ago

Are you working with PyTorch 2.1? This might be linked to an issue you can find here: huggingface/diffusers#5484 Also, if it fails at the end, it's likely just a validation issue. Training is probably done (check your LoRA save path), if so, you can ignore it, or just remove validation_~ in arguments.

Yep, PyTorch 2.1. What version should it be?

Without enable_xformers_memory_efficient_attention flag training works fine.

Yes, if your GPU has >16 GB memory. On a Colab I was trying with V100 it was failing at the end. A100 was unavailable at that time. A few hours after A100 became available and then it worked.

sayakpaul commented 7 months ago

If you're on PyTorch 2.1, it might be a problem to use it xformers (see: https://github.com/huggingface/diffusers/issues/5484). For that case, we default to SDPA (scaled dot-product attention) and should run on a Google Colab free-tier.

If xformers usage is a must, I would recommend using it with Torch 1.13.1.

euminds commented 7 months ago

Are you working with PyTorch 2.1? This might be linked to an issue you can find here: huggingface/diffusers#5484 Also, if it fails at the end, it's likely just a validation issue. Training is probably done (check your LoRA save path), if so, you can ignore it, or just remove validation_~ in arguments.

Yep, PyTorch 2.1. What version should it be?

Without enable_xformers_memory_efficient_attention flag training works fine.

Yes, if your GPU has >16 GB memory. On a Colab I was trying with V100 it was failing at the end. A100 was unavailable at that time. A few hours after A100 became available and then it worked.

When I was running train_dreambooth_ziplora_sdxl.py on 4090 (24G), I also running into the "CUDA out of memory".

jinnsp commented 7 months ago

Are you working with PyTorch 2.1? This might be linked to an issue you can find here: huggingface/diffusers#5484 Also, if it fails at the end, it's likely just a validation issue. Training is probably done (check your LoRA save path), if so, you can ignore it, or just remove validation_~ in arguments.

Yep, PyTorch 2.1. What version should it be?

Without enable_xformers_memory_efficient_attention flag training works fine.

Yes, if your GPU has >16 GB memory. On a Colab I was trying with V100 it was failing at the end. A100 was unavailable at that time. A few hours after A100 became available and then it worked.

When I was running train_dreambooth_ziplora_sdxl.py on 4090 (24G), I also running into the "CUDA out of memory".

See #8. You can free vram instantly. I successfully run it on 4090.

euminds commented 7 months ago

Are you working with PyTorch 2.1? This might be linked to an issue you can find here: huggingface/diffusers#5484 Also, if it fails at the end, it's likely just a validation issue. Training is probably done (check your LoRA save path), if so, you can ignore it, or just remove validation_~ in arguments.

Yep, PyTorch 2.1. What version should it be?

Without enable_xformers_memory_efficient_attention flag training works fine.

Yes, if your GPU has >16 GB memory. On a Colab I was trying with V100 it was failing at the end. A100 was unavailable at that time. A few hours after A100 became available and then it worked.

When I was running train_dreambooth_ziplora_sdxl.py on 4090 (24G), I also running into the "CUDA out of memory".

See #8. You can free vram instantly. I successfully run it on 4090.

Without enable_xformers_memory_efficient_attention flag and followinghttps://github.com/https://github.com/mkshing/ziplora-pytorch/pull/8, while I still run into the "CUDA out of memory" on 3090 (24G), sorry not 4090. My accelearte config compute_environment: LOCAL_MACHINE debug: false distributed_type: MULTI_GPU downcast_bf16: 'no' gpu_ids: 1,2 machine_rank: 0 main_training_function: main mixed_precision: fp16 num_machines: 1 num_processes: 2 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false

xiaohaipeng commented 6 months ago

i think xformers and torch used different version cuda to compile,caused this problem,recompile xformers from source code ,try it