Closed ShunL12324 closed 1 year ago
cc @sayakpaul
I am unable to reproduce this.
My environment is as follows:
- `diffusers` version: 0.18.0.dev0
- Platform: Linux-4.19.0-24-cloud-amd64-x86_64-with-glibc2.10
- Python version: 3.8.16
- PyTorch version (GPU?): 1.13.1+cu116 (True)
- Huggingface_hub version: 0.13.2
- Transformers version: 4.26.1
- Accelerate version: 0.18.0
- xFormers version: 0.0.16
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: No
diffusers
was installed like so:
pip install git+https://github.com/huggingface/diffusers
I used the following commands to launch training:
export DATASET_NAME="lambdalabs/pokemon-blip-captions"
export MODEL_NAME="runwayml/stable-diffusion-v1-5"
accelerate launch --mixed_precision="fp16" train_text_to_image_lora.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--dataset_name=$DATASET_NAME --caption_column="text" \
--resolution=512 --random_flip \
--train_batch_size=1 \
--num_train_epochs=100 --checkpointing_steps=5000 \
--learning_rate=1e-04 --lr_scheduler="constant" --lr_warmup_steps=0 \
--seed=42 \
--output_dir="sd-pokemon-model-lora"
What am I missing out on?
I did following steps:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda-repo-ubuntu2204-11-8-local_11.8.0-520.61.05-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2204-11-8-local_11.8.0-520.61.05-1_amd64.deb
sudo cp /var/cuda-repo-ubuntu2204-11-8-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda
After installation:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:00:04.0 Off | 0 |
| N/A 33C P0 24W / 300W | 0MiB / 16384MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
3. clone diffusers and create an python virtual env, installed torch, diffusers etc.
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 pip install git+https://github.com/huggingface/diffusers cd "$HOME/diffusers/examples/research_projects/lora" || exit pip install -r requirements.txt pip install safetensors pip install omegaconf pip install accelerate accelerate config
Do you wish to use FP16 or BF16 (mixed precision)? fp16
4. train, use exactly the same command
accelerate configuration saved at /home/shun/.cache/huggingface/accelerate/default_config.yaml
(.env) shun@instance-1:~/diffusers/examples/research_projects/lora$ export DATASET_NAME="lambdalabs/pokemon-blip-captions"
export MODEL_NAME="runwayml/stable-diffusion-v1-5"
(.env) shun@instance-1:~/diffusers/examples/research_projects/lora$ accelerate launch --mixed_precision="fp16" train_text_to_image_lora.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--dataset_name=$DATASET_NAME --caption_column="text" \
--resolution=512 --random_flip \
--train_batch_size=1 \
--num_train_epochs=100 --checkpointing_steps=5000 \
--learning_rate=1e-04 --lr_scheduler="constant" --lr_warmup_steps=0 \
--seed=42 \
--output_dir="sd-pokemon-model-lora"
06/29/2023 14:39:54 - INFO - main - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda
Mixed precision type: fp16
Downloading (β¦)cheduler_config.json: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 308/308 [00:00<00:00, 2.00MB/s]
{'dynamic_thresholding_ratio', 'variance_type', 'clip_sample_range', 'prediction_type', 'sample_max_value', 'thresholding'} was not found in config. Values will be initialized to default values.
Downloading (β¦)tokenizer/vocab.json: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1.06M/1.06M [00:00<00:00, 20.6MB/s]
Downloading (β¦)tokenizer/merges.txt: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 525k/525k [00:00<00:00, 146MB/s]
Downloading (β¦)cial_tokens_map.json: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 472/472 [00:00<00:00, 2.74MB/s]
Downloading (β¦)okenizer_config.json: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 806/806 [00:00<00:00, 4.10MB/s]
Downloading (β¦)_encoder/config.json: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 617/617 [00:00<00:00, 3.11MB/s]
Downloading model.safetensors: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 492M/492M [00:06<00:00, 77.1MB/s]
Downloading (β¦)main/vae/config.json: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 547/547 [00:00<00:00, 3.19MB/s]
Downloading (β¦)ch_model.safetensors: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 335M/335M [00:04<00:00, 73.0MB/s]
{'scaling_factor'} was not found in config. Values will be initialized to default values.
Downloading (β¦)ain/unet/config.json: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 743/743 [00:00<00:00, 5.00MB/s]
Downloading (β¦)ch_model.safetensors: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 3.44G/3.44G [00:44<00:00, 77.2MB/s]
{'use_linear_projection', 'time_embedding_act_fn', 'class_embeddings_concat', 'num_attention_heads', 'cross_attention_norm', 'only_cross_attention', 'conv_in_kernel', 'time_cond_proj_dim', 'conv_out_kernel', 'upcast_attention', 'mid_block_only_cross_attention', 'resnet_time_scale_shift', 'time_embedding_dim', 'projection_class_embeddings_input_dim', 'mid_block_type', 'addition_embed_type', 'resnet_skip_time_act', 'num_class_embeds', 'dual_cross_attention', 'class_embed_type', 'resnet_out_scale_factor', 'time_embedding_type', 'timestep_post_act', 'addition_embed_type_num_heads', 'encoder_hid_dim_type', 'encoder_hid_dim'} was not found in config. Values will be initialized to default values.
Downloading metadata: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 731/731 [00:00<00:00, 825kB/s]
Downloading readme: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1.80k/1.80k [00:00<00:00, 1.63MB/s]
Downloading and preparing dataset imagefolder/pokemon (download: 95.05 MiB, generated: 113.89 MiB, post-processed: Unknown size, total: 208.94 MiB) to /home/shun/.cache/huggingface/datasets/lambdalabsparquet/lambdalabs--pokemon-blip-captions-10e3527a764857bd/0.0.0/14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7...
Downloading data: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 99.7M/99.7M [00:00<00:00, 124MB/s]
Downloading data files: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:01<00:00, 1.39s/it]
Extracting data files: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 1693.30it/s]
Dataset parquet downloaded and prepared to /home/shun/.cache/huggingface/datasets/lambdalabsparquet/lambdalabs--pokemon-blip-captions-10e3527a764857bd/0.0.0/14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7. Subsequent calls will reuse this data.
100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 737.27it/s]
06/29/2023 14:40:59 - INFO - main - Running training
06/29/2023 14:40:59 - INFO - main - Num examples = 833
06/29/2023 14:40:59 - INFO - main - Num Epochs = 100
06/29/2023 14:40:59 - INFO - main - Instantaneous batch size per device = 1
06/29/2023 14:40:59 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 1
06/29/2023 14:40:59 - INFO - main - Gradient Accumulation steps = 1
06/29/2023 14:40:59 - INFO - main - Total optimization steps = 83300
Steps: 0%| | 0/83300 [00:00<?, ?it/s]Traceback (most recent call last):
File "/home/shun/diffusers/examples/research_projects/lora/train_text_to_image_lora.py", line 1014, in
-------
- Python version: 3.10.6
- Linux instance-1 5.19.0-1026-gcp #28~22.04.1-Ubuntu SMP Tue Jun 6 07:24:26 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
- transformers version: 4.30.2
- diffusers version: 0.18.0.dev0
- torch version: 2.0.1+cu118
- huggingface_hub version: 0.15.1
- accelerate version: 0.20.3
- xformers: not install
------
so that could be a problem with torch version?
@sayakpaul
Ah, so you're using Torch 2.0. Cool will investigate. Thanks so much for being so detailed.
Tested with torch 1.13.1 and cuda 116, still the same error π
Python 3.8.10 (default, May 26 2023, 14:05:08)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
'1.13.1+cu116'
>>> torch.cuda.is_available()
True
>>>
Error:
accelerate launch --mixed_precision="fp16" train_text_to_image_lora.py \
> --pretrained_model_name_or_path=$MODEL_NAME \
> --dataset_name=$DATASET_NAME --caption_column="text" \
> --resolution=512 --random_flip \
> --train_batch_size=1 \
> --num_train_epochs=100 --checkpointing_steps=5000 \
> --learning_rate=1e-04 --lr_scheduler="constant" --lr_warmup_steps=0 \
> --seed=42 \
> --output_dir="sd-pokemon-model-lora"
06/30/2023 05:43:43 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda
Mixed precision type: fp16
Downloading (β¦)cheduler_config.json: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 308/308 [00:00<00:00, 34.7kB/s]
{'dynamic_thresholding_ratio', 'thresholding', 'prediction_type', 'variance_type', 'sample_max_value', 'clip_sample_range'} was not found in config. Values will be initialized to default values.
Downloading (β¦)tokenizer/vocab.json: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1.06M/1.06M [00:00<00:00, 18.2MB/s]
Downloading (β¦)tokenizer/merges.txt: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 525k/525k [00:00<00:00, 89.0MB/s]
Downloading (β¦)cial_tokens_map.json: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 472/472 [00:00<00:00, 273kB/s]
Downloading (β¦)okenizer_config.json: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 806/806 [00:00<00:00, 477kB/s]
Downloading (β¦)_encoder/config.json: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 617/617 [00:00<00:00, 73.4kB/s]
Downloading model.safetensors: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 492M/492M [00:02<00:00, 195MB/s]
Downloading (β¦)main/vae/config.json: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 547/547 [00:00<00:00, 67.4kB/s]
Downloading (β¦)ch_model.safetensors: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 335M/335M [00:01<00:00, 201MB/s]
{'scaling_factor'} was not found in config. Values will be initialized to default values.
Downloading (β¦)ain/unet/config.json: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 743/743 [00:00<00:00, 328kB/s]
Downloading (β¦)ch_model.safetensors: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 3.44G/3.44G [00:39<00:00, 86.9MB/s]
{'time_cond_proj_dim', 'conv_out_kernel', 'addition_embed_type_num_heads', 'projection_class_embeddings_input_dim', 'resnet_out_scale_factor', 'encoder_hid_dim', 'mid_block_type', 'cross_attention_norm', 'timestep_post_act', 'conv_in_kernel', 'time_embedding_act_fn', 'dual_cross_attention', 'only_cross_attention', 'upcast_attention', 'class_embed_type', 'resnet_skip_time_act', 'use_linear_projection', 'num_attention_heads', 'time_embedding_type', 'num_class_embeds', 'time_embedding_dim', 'class_embeddings_concat', 'encoder_hid_dim_type', 'mid_block_only_cross_attention', 'resnet_time_scale_shift', 'addition_embed_type'} was not found in config. Values will be initialized to default values.
Downloading metadata: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 731/731 [00:00<00:00, 4.82MB/s]
Downloading readme: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1.80k/1.80k [00:00<00:00, 11.1MB/s]
Downloading and preparing dataset imagefolder/pokemon (download: 95.05 MiB, generated: 113.89 MiB, post-processed: Unknown size, total: 208.94 MiB) to /home/shun/.cache/huggingface/datasets/lambdalabs___parquet/lambdalabs--pokemon-blip-captions-10e3527a764857bd/0.0.0/14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7...
Downloading data: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 99.7M/99.7M [00:01<00:00, 68.2MB/s]
Downloading data files: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:02<00:00, 2.10s/it]
Extracting data files: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 1600.88it/s]
Dataset parquet downloaded and prepared to /home/shun/.cache/huggingface/datasets/lambdalabs___parquet/lambdalabs--pokemon-blip-captions-10e3527a764857bd/0.0.0/14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7. Subsequent calls will reuse this data.
100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 713.32it/s]
06/30/2023 05:44:37 - INFO - __main__ - ***** Running training *****
06/30/2023 05:44:37 - INFO - __main__ - Num examples = 833
06/30/2023 05:44:37 - INFO - __main__ - Num Epochs = 100
06/30/2023 05:44:37 - INFO - __main__ - Instantaneous batch size per device = 1
06/30/2023 05:44:37 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 1
06/30/2023 05:44:37 - INFO - __main__ - Gradient Accumulation steps = 1
06/30/2023 05:44:37 - INFO - __main__ - Total optimization steps = 83300
Steps: 0%| | 0/83300 [00:00<?, ?it/s]Traceback (most recent call last):
File "train_text_to_image_lora.py", line 1014, in <module>
main()
File "train_text_to_image_lora.py", line 817, in main
model_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample
File "/home/shun/diffusers/.env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/shun/diffusers/.env/lib/python3.8/site-packages/diffusers/models/unet_2d_condition.py", line 765, in forward
emb = self.time_embedding(t_emb, timestep_cond)
File "/home/shun/diffusers/.env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/shun/diffusers/.env/lib/python3.8/site-packages/diffusers/models/embeddings.py", line 192, in forward
sample = self.linear_1(sample)
File "/home/shun/diffusers/.env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/shun/diffusers/.env/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat1 in method wrapper_addmm)
Steps: 0%| | 0/83300 [00:01<?, ?it/s]
Traceback (most recent call last):
File "/home/shun/diffusers/.env/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/home/shun/diffusers/.env/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/home/shun/diffusers/.env/lib/python3.8/site-packages/accelerate/commands/launch.py", line 941, in launch_command
simple_launcher(args)
File "/home/shun/diffusers/.env/lib/python3.8/site-packages/accelerate/commands/launch.py", line 603, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/shun/diffusers/.env/bin/python3', 'train_text_to_image_lora.py', '--pretrained_model_name_or_path=runwayml/stable-diffusion-v1-5', '--dataset_name=lambdalabs/pokemon-blip-captions', '--caption_column=text', '--resolution=512', '--random_flip', '--train_batch_size=1', '--num_train_epochs=100', '--checkpointing_steps=5000', '--learning_rate=1e-04', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--seed=42', '--output_dir=sd-pokemon-model-lora']' returned non-zero exit status 1.
Tried with no accelerate:
python3 train_text_to_image_lora.py --pretrained_model_name_or_path=$MODEL_NAME --dataset_name=$DATASET_NAME --caption_column="text" --resolution=512 --random_flip --train_batch_size=1 --num_train_epochs=100 --checkpointing_steps=5000 --learning_rate=1e-04 --lr_scheduler="constant" --lr_warmup_steps=0 --seed=42 --output_dir="sd-pokemon-model-lora"
06/30/2023 05:50:24 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda
Mixed precision type: no
{'sample_max_value', 'variance_type', 'thresholding', 'clip_sample_range', 'dynamic_thresholding_ratio', 'prediction_type'} was not found in config. Values will be initialized to default values.
{'scaling_factor'} was not found in config. Values will be initialized to default values.
{'class_embeddings_concat', 'class_embed_type', 'resnet_skip_time_act', 'use_linear_projection', 'time_embedding_act_fn', 'encoder_hid_dim', 'addition_embed_type', 'conv_out_kernel', 'resnet_time_scale_shift', 'time_cond_proj_dim', 'addition_embed_type_num_heads', 'time_embedding_type', 'resnet_out_scale_factor', 'num_class_embeds', 'conv_in_kernel', 'encoder_hid_dim_type', 'dual_cross_attention', 'timestep_post_act', 'time_embedding_dim', 'projection_class_embeddings_input_dim', 'cross_attention_norm', 'mid_block_only_cross_attention', 'upcast_attention', 'mid_block_type', 'only_cross_attention', 'num_attention_heads'} was not found in config. Values will be initialized to default values.
06/30/2023 05:50:29 - WARNING - datasets.builder - Found cached dataset parquet (/home/shun/.cache/huggingface/datasets/lambdalabs___parquet/lambdalabs--pokemon-blip-captions-10e3527a764857bd/0.0.0/14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7)
100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 661.04it/s]
06/30/2023 05:50:30 - INFO - __main__ - ***** Running training *****
06/30/2023 05:50:30 - INFO - __main__ - Num examples = 833
06/30/2023 05:50:30 - INFO - __main__ - Num Epochs = 100
06/30/2023 05:50:30 - INFO - __main__ - Instantaneous batch size per device = 1
06/30/2023 05:50:30 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 1
06/30/2023 05:50:30 - INFO - __main__ - Gradient Accumulation steps = 1
06/30/2023 05:50:30 - INFO - __main__ - Total optimization steps = 83300
Steps: 0%| | 0/83300 [00:00<?, ?it/s]Traceback (most recent call last):
File "train_text_to_image_lora.py", line 1014, in <module>
main()
File "train_text_to_image_lora.py", line 817, in main
model_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample
File "/home/shun/diffusers/.env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/shun/diffusers/.env/lib/python3.8/site-packages/diffusers/models/unet_2d_condition.py", line 765, in forward
emb = self.time_embedding(t_emb, timestep_cond)
File "/home/shun/diffusers/.env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/shun/diffusers/.env/lib/python3.8/site-packages/diffusers/models/embeddings.py", line 192, in forward
sample = self.linear_1(sample)
File "/home/shun/diffusers/.env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/shun/diffusers/.env/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat1 in method wrapper_addmm)
Steps: 0%| | 0/83300 [00:01<?, ?it/s]
Nvidia-smi output:
Fri Jun 30 05:53:01 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:00:04.0 Off | 0 |
| N/A 35C P0 22W / 300W | 108MiB / 16384MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1025 G /usr/lib/xorg/Xorg 95MiB |
| 0 N/A N/A 1132 G /usr/bin/gnome-shell 12MiB |
+-----------------------------------------------------------------------------+
@sayakpaul
Now this I cannot confirm as mentioned in https://github.com/huggingface/diffusers/issues/3884#issuecomment-1613007671. I would maybe suggest seeing if this error persists when you reinstall diffusers from source (pip install git+https://github.com/huggingface/diffusers
).
Meanwhile I am looking into if this fails with PT 2.0.
I am still unable to reproduce the bug even on PT 2.0. Check out this Colab Gist: https://colab.research.google.com/gist/sayakpaul/065dd9dd92bf41af954c5a18470e64eb/scratchpad.ipynb.
I set up the environment there and started the training from a Colab Terminal (requires a pro subscription). It went as expected.
@pcuenca if you have time, could you maybe see you're able to reproduce the bug reported? This is just to confirm I am not missing out on anything obvious.
Can you check this colab link to see what I did wrong? https://colab.research.google.com/drive/1qfugOTGcpg9RJDZjdwrgPnaYWObUeJUQ?usp=sharing
thanks for your time @sayakpaul
Ah, you're using the LoRA script for the research project. Unfortunately, we don't maintain the directory. So, I am pinging @@haofanwang to check what they have to say.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Describe the bug
Got following error when try to use train_text_to_image_lora.py.
here is full log: https://pastebin.com/Mjy5yKHe
Reproduction
just run the train_text_to_image_lora.py script, see log.
Logs
System Info
diffuser version: installed from source system: ubuntu 22.04 cuda: 11.8
Who can help?
@williamberman, @sayakpaul, @yiyixuxu