train_text_to_image_flax.py no flax_model.msgpack or pytorch_model.bin

treksis commented 1 year ago

Describe the bug

Hi, I'm in colab pro environment using TPU v2 for the test purpose.

I get this error

flax_model.msgpack or pytorch_model.bin.

2023-02-18 02:07:26.213102: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-02-18 02:07:26.213285: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-02-18 02:07:26.213310: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2023-02-18 02:07:29.194168: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
WARNING:jax._src.lib.xla_bridge:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
WARNING:datasets.builder:Using custom data configuration lambdalabs--pokemon-blip-captions-10e3527a764857bd
WARNING:datasets.builder:Found cached dataset parquet (/root/.cache/huggingface/datasets/lambdalabs___parquet/lambdalabs--pokemon-blip-captions-10e3527a764857bd/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
100% 1/1 [00:00<00:00, 369.54it/s]
loading file vocab.json from cache at /root/.cache/huggingface/hub/models--CompVis--stable-diffusion-v1-4/snapshots/3857c45b7d4e78b3ba0f39d4d7f50a2a05aa23d4/tokenizer/vocab.json
loading file merges.txt from cache at /root/.cache/huggingface/hub/models--CompVis--stable-diffusion-v1-4/snapshots/3857c45b7d4e78b3ba0f39d4d7f50a2a05aa23d4/tokenizer/merges.txt
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--CompVis--stable-diffusion-v1-4/snapshots/3857c45b7d4e78b3ba0f39d4d7f50a2a05aa23d4/tokenizer/special_tokens_map.json
loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--CompVis--stable-diffusion-v1-4/snapshots/3857c45b7d4e78b3ba0f39d4d7f50a2a05aa23d4/tokenizer/tokenizer_config.json
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--CompVis--stable-diffusion-v1-4/snapshots/3857c45b7d4e78b3ba0f39d4d7f50a2a05aa23d4/text_encoder/config.json
Model config CLIPTextConfig {
  "_name_or_path": "openai/clip-vit-large-patch14",
  "architectures": [
    "CLIPTextModel"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 0,
  "dropout": 0.0,
  "eos_token_id": 2,
  "hidden_act": "quick_gelu",
  "hidden_size": 768,
  "initializer_factor": 1.0,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 77,
  "model_type": "clip_text_model",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "projection_dim": 512,
  "torch_dtype": "float32",
  "transformers_version": "4.26.1",
  "vocab_size": 49408
}

╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /content/diffusers/examples/text_to_image/train_text_to_image_flax.py:579 in │
│ <module>                                                                     │
│                                                                              │
│   576                                                                        │
│   577                                                                        │
│   578 if __name__ == "__main__":                                             │
│ ❱ 579 │   main()                                                             │
│   580                                                                        │
│                                                                              │
│ /content/diffusers/examples/text_to_image/train_text_to_image_flax.py:390 in │
│ main                                                                         │
│                                                                              │
│   387 │                                                                      │
│   388 │   # Load models and create wrapper for stable diffusion              │
│   389 │   tokenizer = CLIPTokenizer.from_pretrained(args.pretrained_model_na │
│ ❱ 390 │   text_encoder = FlaxCLIPTextModel.from_pretrained(                  │
│   391 │   │   args.pretrained_model_name_or_path, subfolder="text_encoder",  │
│   392 │   )                                                                  │
│   393 │   vae, vae_params = FlaxAutoencoderKL.from_pretrained(               │
│                                                                              │
│ /usr/local/lib/python3.8/dist-packages/transformers/modeling_flax_utils.py:7 │
│ 64 in from_pretrained                                                        │
│                                                                              │
│    761 │   │   │   │   │   │   │   │   " `from_pt=True` to load this model f │
│    762 │   │   │   │   │   │   │   )                                         │
│    763 │   │   │   │   │   │   else:                                         │
│ ❱  764 │   │   │   │   │   │   │   raise EnvironmentError(                   │
│    765 │   │   │   │   │   │   │   │   f"{pretrained_model_name_or_path} doe │
│    766 │   │   │   │   │   │   │   │   f" {FLAX_WEIGHTS_NAME} or {WEIGHTS_NA │
│    767 │   │   │   │   │   │   │   )                                         │
╰──────────────────────────────────────────────────────────────────────────────╯
OSError: CompVis/stable-diffusion-v1-4 does not appear to have a file named 
flax_model.msgpack or pytorch_model.bin.

Reproduction

!git clone https://github.com/huggingface/diffusers
%cd diffusers
!pip install .

%cd /content/diffusers/examples/text_to_image
pip install -r requirements_flax.txt

!huggingface-cli login
!accelerate config

MODEL_NAME="CompVis/stable-diffusion-v1-4"
dataset_name="lambdalabs/pokemon-blip-captions"

!python train_text_to_image_flax.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --dataset_name=$dataset_name \
  --resolution=512 --center_crop --random_flip \
  --train_batch_size=1 \
  --max_train_steps=15000 \
  --learning_rate=1e-05 \
  --max_grad_norm=1 \
  --output_dir="sd-pokemon-model"

Logs

No response

System Info

- `diffusers` version: 0.14.0.dev0
- Platform: Linux-5.10.147+-x86_64-with-glibc2.29
- Python version: 3.8.10
- PyTorch version (GPU?): 1.13.1+cu116 (False)
- Huggingface_hub version: 0.12.1
- Transformers version: 4.26.1
- Accelerate version: 0.16.0
- xFormers version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

Accelerate ENV

TPU version

patrickvonplaten commented 1 year ago

Gently pinging @pcuenca - in case you have 5min could you take a look here?

pcuenca commented 1 year ago

Hi @treksis, the problem is that the Flax weights are currently stored in a different branch of the repo, called flax. For this to work, we need to:

Make the Flax version of the text-to-image training script accept a revision command-line argument. This will be addressed by #2567.
You need to update your training command to specify that revision, like so (note the new line --revision=flax:

!python train_text_to_image_flax.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --revision=flax \
  --dataset_name=$dataset_name \
  --resolution=512 --center_crop --random_flip \
  --train_batch_size=1 \
  --max_train_steps=15000 \
  --learning_rate=1e-05 \
  --max_grad_norm=1 \
  --output_dir="sd-pokemon-model"

Note, however, that the preferred way to store different model weights going forward will be through the use of variants (see #2305 for details), so those weights will be unified in the main branch in the future.

huggingface / diffusers