huggingface / diffusers

๐Ÿค— Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
https://huggingface.co/docs/diffusers
Apache License 2.0
25.49k stars 5.28k forks source link

train_text_to_image_flax.py no flax_model.msgpack or pytorch_model.bin #2410

Closed treksis closed 1 year ago

treksis commented 1 year ago

Describe the bug

Hi, I'm in colab pro environment using TPU v2 for the test purpose.

I get this error

flax_model.msgpack or pytorch_model.bin.
2023-02-18 02:07:26.213102: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-02-18 02:07:26.213285: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-02-18 02:07:26.213310: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2023-02-18 02:07:29.194168: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
WARNING:jax._src.lib.xla_bridge:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
WARNING:datasets.builder:Using custom data configuration lambdalabs--pokemon-blip-captions-10e3527a764857bd
WARNING:datasets.builder:Found cached dataset parquet (/root/.cache/huggingface/datasets/lambdalabs___parquet/lambdalabs--pokemon-blip-captions-10e3527a764857bd/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
100% 1/1 [00:00<00:00, 369.54it/s]
loading file vocab.json from cache at /root/.cache/huggingface/hub/models--CompVis--stable-diffusion-v1-4/snapshots/3857c45b7d4e78b3ba0f39d4d7f50a2a05aa23d4/tokenizer/vocab.json
loading file merges.txt from cache at /root/.cache/huggingface/hub/models--CompVis--stable-diffusion-v1-4/snapshots/3857c45b7d4e78b3ba0f39d4d7f50a2a05aa23d4/tokenizer/merges.txt
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--CompVis--stable-diffusion-v1-4/snapshots/3857c45b7d4e78b3ba0f39d4d7f50a2a05aa23d4/tokenizer/special_tokens_map.json
loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--CompVis--stable-diffusion-v1-4/snapshots/3857c45b7d4e78b3ba0f39d4d7f50a2a05aa23d4/tokenizer/tokenizer_config.json
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--CompVis--stable-diffusion-v1-4/snapshots/3857c45b7d4e78b3ba0f39d4d7f50a2a05aa23d4/text_encoder/config.json
Model config CLIPTextConfig {
  "_name_or_path": "openai/clip-vit-large-patch14",
  "architectures": [
    "CLIPTextModel"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 0,
  "dropout": 0.0,
  "eos_token_id": 2,
  "hidden_act": "quick_gelu",
  "hidden_size": 768,
  "initializer_factor": 1.0,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 77,
  "model_type": "clip_text_model",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "projection_dim": 512,
  "torch_dtype": "float32",
  "transformers_version": "4.26.1",
  "vocab_size": 49408
}

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Traceback (most recent call last) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ /content/diffusers/examples/text_to_image/train_text_to_image_flax.py:579 in โ”‚
โ”‚ <module>                                                                     โ”‚
โ”‚                                                                              โ”‚
โ”‚   576                                                                        โ”‚
โ”‚   577                                                                        โ”‚
โ”‚   578 if __name__ == "__main__":                                             โ”‚
โ”‚ โฑ 579 โ”‚   main()                                                             โ”‚
โ”‚   580                                                                        โ”‚
โ”‚                                                                              โ”‚
โ”‚ /content/diffusers/examples/text_to_image/train_text_to_image_flax.py:390 in โ”‚
โ”‚ main                                                                         โ”‚
โ”‚                                                                              โ”‚
โ”‚   387 โ”‚                                                                      โ”‚
โ”‚   388 โ”‚   # Load models and create wrapper for stable diffusion              โ”‚
โ”‚   389 โ”‚   tokenizer = CLIPTokenizer.from_pretrained(args.pretrained_model_na โ”‚
โ”‚ โฑ 390 โ”‚   text_encoder = FlaxCLIPTextModel.from_pretrained(                  โ”‚
โ”‚   391 โ”‚   โ”‚   args.pretrained_model_name_or_path, subfolder="text_encoder",  โ”‚
โ”‚   392 โ”‚   )                                                                  โ”‚
โ”‚   393 โ”‚   vae, vae_params = FlaxAutoencoderKL.from_pretrained(               โ”‚
โ”‚                                                                              โ”‚
โ”‚ /usr/local/lib/python3.8/dist-packages/transformers/modeling_flax_utils.py:7 โ”‚
โ”‚ 64 in from_pretrained                                                        โ”‚
โ”‚                                                                              โ”‚
โ”‚    761 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   " `from_pt=True` to load this model f โ”‚
โ”‚    762 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   )                                         โ”‚
โ”‚    763 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   else:                                         โ”‚
โ”‚ โฑ  764 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   raise EnvironmentError(                   โ”‚
โ”‚    765 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   f"{pretrained_model_name_or_path} doe โ”‚
โ”‚    766 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   f" {FLAX_WEIGHTS_NAME} or {WEIGHTS_NA โ”‚
โ”‚    767 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   )                                         โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
OSError: CompVis/stable-diffusion-v1-4 does not appear to have a file named 
flax_model.msgpack or pytorch_model.bin.

Reproduction

!git clone https://github.com/huggingface/diffusers
%cd diffusers
!pip install .

%cd /content/diffusers/examples/text_to_image
pip install -r requirements_flax.txt

!huggingface-cli login
!accelerate config

MODEL_NAME="CompVis/stable-diffusion-v1-4"
dataset_name="lambdalabs/pokemon-blip-captions"

!python train_text_to_image_flax.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --dataset_name=$dataset_name \
  --resolution=512 --center_crop --random_flip \
  --train_batch_size=1 \
  --max_train_steps=15000 \
  --learning_rate=1e-05 \
  --max_grad_norm=1 \
  --output_dir="sd-pokemon-model" 

Logs

No response

System Info

- `diffusers` version: 0.14.0.dev0
- Platform: Linux-5.10.147+-x86_64-with-glibc2.29
- Python version: 3.8.10
- PyTorch version (GPU?): 1.13.1+cu116 (False)
- Huggingface_hub version: 0.12.1
- Transformers version: 4.26.1
- Accelerate version: 0.16.0
- xFormers version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

Accelerate ENV image

TPU version

image

patrickvonplaten commented 1 year ago

Gently pinging @pcuenca - in case you have 5min could you take a look here?

pcuenca commented 1 year ago

Hi @treksis, the problem is that the Flax weights are currently stored in a different branch of the repo, called flax. For this to work, we need to:

!python train_text_to_image_flax.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --revision=flax \
  --dataset_name=$dataset_name \
  --resolution=512 --center_crop --random_flip \
  --train_batch_size=1 \
  --max_train_steps=15000 \
  --learning_rate=1e-05 \
  --max_grad_norm=1 \
  --output_dir="sd-pokemon-model"

Note, however, that the preferred way to store different model weights going forward will be through the use of variants (see #2305 for details), so those weights will be unified in the main branch in the future.