stable-diffusion-2-1-base BF16 on Gaudi2D get RuntimeError: synNodeCreateWithId failed for node: spatial_convolution with synStatus 26 [Generic failure].

KiwiHana commented 3 weeks ago

System Info

docker run -it --name sd --runtime=habana -v /home/sd/:/data/ -e "http_proxy=$http_proxy" -e "https_proxy=$https_proxy" -e "no_proxy=localhost,127.0.0.1" -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.17.0/ubuntu22.04/habanalabs/pytorch-installer-2.3.1:latest

optimum                     1.21.4
optimum-habana              1.14.0.dev0

habana_gpu_migration        1.17.0.495
habana-media-loader         1.17.0.495
habana-pyhlml               1.17.0.495
habana_quantization_toolkit 1.17.0.495
habana-torch-dataloader     1.17.0.495
habana-torch-plugin         1.17.0.495

| HL-SMI Version:                                hl-1.17.0-fw-51.3.0          |
| Driver Version:                                     1.17.0-28a11ca

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

cd /data/optimum-habana/examples/stable-diffusion

python text_to_image_generation.py --model_name_or_path /data/sd/stable-diffusion-2-1-base --prompts "An image of a squirrel in Picasso style" --num_images_per_prompt 28 --batch_size 1 --height 512 --width 512 --image_save_dir /tmp/stable_diffusion_images --use_habana --use_hpu_graphs --gaudi_config Habana/stable-diffusion-2 --bf16

Expected behavior

/data/optimum-habana/examples/stable-diffusion# python text_to_image_generation.py --model_name_or_path /data/stable-diffusion-2-1-base --prompts "An image of a squirrel in Picasso style" --num_images_per_prompt 28 --batch_size 1 --height 512 --width 512 --image_save_dir /data/stable_diffusion_images --use_habana --use_hpu_graphs --gaudi_config /data --bf16
/usr/local/lib/python3.10/dist-packages/diffusers/models/vq_model.py:20: FutureWarning: `VQEncoderOutput` is deprecated and will be removed in version 0.31. Importing `VQEncoderOutput` from `diffusers.models.vq_model` is deprecated and this will be removed in a future version. Please use `from diffusers.models.autoencoders.vq_model import VQEncoderOutput`, instead.
  deprecate("VQEncoderOutput", "0.31", deprecation_message)
/usr/local/lib/python3.10/dist-packages/diffusers/models/vq_model.py:25: FutureWarning: `VQModel` is deprecated and will be removed in version 0.31. Importing `VQModel` from `diffusers.models.vq_model` is deprecated and this will be removed in a future version. Please use `from diffusers.models.autoencoders.vq_model import VQModel`, instead.
  deprecate("VQModel", "0.31", deprecation_message)
Loading pipeline components...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00,  9.80it/s]
[INFO|pipeline_utils.py:130] 2024-08-23 09:23:27,270 >> Enabled HPU graphs.
/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:366: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
  warnings.warn(
[INFO|configuration_utils.py:325] 2024-08-23 09:23:27,397 >> loading configuration file /data/gaudi_config.json
[INFO|configuration_utils.py:380] 2024-08-23 09:23:27,397 >> GaudiConfig {
  "autocast_bf16_ops": [
    "_convolution.deprecated",
    "_convolution",
    "conv1d",
    "conv2d",
    "conv3d",
    "conv_tbc",
    "conv_transpose1d",
    "conv_transpose2d.input",
    "conv_transpose3d.input",
    "convolution",
    "prelu",
    "addmm",
    "addmv",
    "addr",
    "matmul",
    "einsum",
    "mm",
    "mv",
    "silu",
    "linear",
    "addbmm",
    "baddbmm",
    "bmm",
    "chain_matmul",
    "linalg_multi_dot",
    "layer_norm",
    "group_norm"
  ],
  "autocast_fp32_ops": [
    "acos",
    "asin",
    "cosh",
    "erfinv",
    "exp",
    "expm1",
    "log",
    "log10",
    "log2",
    "log1p",
    "reciprocal",
    "rsqrt",
    "sinh",
    "tan",
    "pow.Tensor_Scalar",
    "pow.Tensor_Tensor",
    "pow.Scalar",
    "softplus",
    "frobenius_norm",
    "frobenius_norm.dim",
    "nuclear_norm",
    "nuclear_norm.dim",
    "cosine_similarity",
    "poisson_nll_loss",
    "cosine_embedding_loss",
    "nll_loss",
    "nll_loss2d",
    "hinge_embedding_loss",
    "kl_div",
    "l1_loss",
    "smooth_l1_loss",
    "huber_loss",
    "mse_loss",
    "margin_ranking_loss",
    "multilabel_margin_loss",
    "soft_margin_loss",
    "triplet_margin_loss",
    "multi_margin_loss",
    "binary_cross_entropy_with_logits",
    "dist",
    "pdist",
    "cdist",
    "renorm",
    "logsumexp"
  ],
  "optimum_version": "1.21.4",
  "transformers_version": "4.43.3",
  "use_dynamic_shapes": false,
  "use_fused_adam": true,
  "use_fused_clip_norm": true,
  "use_torch_autocast": true
}

[WARNING|pipeline_utils.py:156] 2024-08-23 09:23:27,398 >> `use_torch_autocast` is True in the given Gaudi configuration but `torch_dtype=torch.bfloat16` was given. Disabling mixed precision and continuing in bf16 only.
============================= HABANA PT BRIDGE CONFIGURATION ===========================
 PT_HPU_LAZY_MODE = 1
 PT_RECIPE_CACHE_PATH =
 PT_CACHE_FOLDER_DELETE = 0
 PT_HPU_RECIPE_CACHE_CONFIG =
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_LAZY_ACC_PAR_MODE = 1
 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
---------------------------: System Configuration :---------------------------
Num CPU Cores : 192
CPU RAM       : 2113382084 KB
------------------------------------------------------------------------------
[INFO|pipeline_stable_diffusion.py:411] 2024-08-23 09:23:31,391 >> 1 prompt(s) received, 28 generation(s) per prompt, 1 sample(s) per batch, 28 total batch(es).
  0%|                                                                                                                                                                     | 0/28 [00:00<?, ?it/s]/usr/local/lib/python3.10/dist-packages/diffusers/models/unets/unet_2d_blocks.py:1369: FutureWarning: `scale` is deprecated and will be removed in version 1.0.0. The `scale` argument is deprecated and will be ignored. Please remove it, as passing it will raise an error in the future. `scale` should directly be passed while calling the underlying pipeline component i.e., via `cross_attention_kwargs`.
  deprecate("scale", "1.0.0", deprecation_message)
/usr/local/lib/python3.10/dist-packages/diffusers/models/unets/unet_2d_blocks.py:2628: FutureWarning: `scale` is deprecated and will be removed in version 1.0.0. The `scale` argument is deprecated and will be ignored. Please remove it, as passing it will raise an error in the future. `scale` should directly be passed while calling the underlying pipeline component i.e., via `cross_attention_kwargs`.
  deprecate("scale", "1.0.0", deprecation_message)
  0%|                                                                                                                                                                     | 0/28 [00:03<?, ?it/s]
Traceback (most recent call last):
  File "/data/optimum-habana/examples/stable-diffusion/text_to_image_generation.py", line 538, in <module>
    main()
  File "/data/optimum-habana/examples/stable-diffusion/text_to_image_generation.py", line 507, in main
    outputs = pipeline(prompt=args.prompts, **kwargs_call)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/optimum/habana/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 532, in __call__
    noise_pred = self.unet_hpu(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/optimum/habana/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 670, in unet_hpu
    return self.capture_replay(latent_model_input, timestep, encoder_hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/optimum/habana/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 694, in capture_replay
    graph.capture_end()
  File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 46, in capture_end
    _hpu_C.capture_end(self.hpu_graph)
RuntimeError: synNodeCreateWithId failed for node: spatial_convolution with synStatus 26 [Generic failure]. .

regisss commented 3 weeks ago

I cannot reproduce it. How did you install optimum-habana and what was the last commit?

KiwiHana commented 2 weeks ago

I cannot reproduce it. How did you install optimum-habana and what was the last commit?

My device is Gaudi2D.

I installed on August 23th.

pip install git+https://github.com/huggingface/optimum-habana.git
git clone https://github.com/huggingface/optimum-habana

It is OK for SDv2.1 FP32 precision like: python text_to_image_generation.py --model_name_or_path /data/sd/stable-diffusion-2-1-base --prompts "An image of a squirrel in Picasso style" --num_images_per_prompt 28 --batch_size 1 --height 512 --width 512 --image_save_dir /tmp/stable_diffusion_images --use_habana --use_hpu_graphs --gaudi_config Habana/stable-diffusion-2

But error for SDv2.1 BF16 precision. RuntimeError: synNodeCreateWithId failed for node: spatial_convolution with synStatus 26 [Generic failure].

python text_to_image_generation.py --model_name_or_path /data/sd/stable-diffusion-2-1-base --prompts "An image of a squirrel in Picasso style" --num_images_per_prompt 28 --batch_size 1 --height 512 --width 512 --image_save_dir /tmp/stable_diffusion_images --use_habana --use_hpu_graphs --gaudi_config Habana/stable-diffusion-2 --bf16

KiwiHana commented 2 weeks ago

 python text_to_image_generation.py --model_name_or_path /data/stable-diffusion-2-1-base --prompts "An image of a squirrel in Picasso style" --num_images_per_prompt 28 --batch_size 1 --height 1920 --width 1080 --image_save_dir /data/stable_diffusion_images --use_habana --use_hpu_graphs --gaudi_config /data/stable-diffusion --bf16

/usr/local/lib/python3.10/dist-packages/diffusers/models/vq_model.py:20: FutureWarning: `VQEncoderOutput` is deprecated and will be removed in version 0.31. Importing `VQEncoderOutput` from `diffusers.models.vq_model` is deprecated and this will be removed in a future version. Please use `from diffusers.models.autoencoders.vq_model import VQEncoderOutput`, instead.
  deprecate("VQEncoderOutput", "0.31", deprecation_message)
/usr/local/lib/python3.10/dist-packages/diffusers/models/vq_model.py:25: FutureWarning: `VQModel` is deprecated and will be removed in version 0.31. Importing `VQModel` from `diffusers.models.vq_model` is deprecated and this will be removed in a future version. Please use `from diffusers.models.autoencoders.vq_model import VQModel`, instead.
  deprecate("VQModel", "0.31", deprecation_message)
Loading pipeline components...: 100%|██████████████████████████████████████████████| 6/6 [00:00<00:00,  9.28it/s]
[INFO|pipeline_utils.py:130] 2024-08-27 04:45:04,526 >> Enabled HPU graphs.
/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:366: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
  warnings.warn(
[INFO|configuration_utils.py:303] 2024-08-27 04:45:04,656 >> loading configuration file /data/stable-diffusion/gaudi_config.json
[INFO|configuration_utils.py:358] 2024-08-27 04:45:04,656 >> GaudiConfig {
  "autocast_bf16_ops": null,
  "autocast_fp32_ops": null,
  "optimum_version": "1.12.0",
  "transformers_version": "4.43.3",
  "use_dynamic_shapes": false,
  "use_fused_adam": true,
  "use_fused_clip_norm": true,
  "use_torch_autocast": true
}
[WARNING|pipeline_utils.py:156] 2024-08-27 04:45:04,656 >> `use_torch_autocast` is True in the given Gaudi configuration but `torch_dtype=torch.bfloat16` was given. Disabling mixed precision and continuing in bf16 only.
============================= HABANA PT BRIDGE CONFIGURATION ===========================
 PT_HPU_LAZY_MODE = 1
 PT_RECIPE_CACHE_PATH =
 PT_CACHE_FOLDER_DELETE = 0
 PT_HPU_RECIPE_CACHE_CONFIG =
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_LAZY_ACC_PAR_MODE = 1
 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
---------------------------: System Configuration :---------------------------
Num CPU Cores : 192
CPU RAM       : 2113382084 KB
------------------------------------------------------------------------------
[INFO|pipeline_stable_diffusion.py:411] 2024-08-27 04:45:09,023 >> 1 prompt(s) received, 28 generation(s) per prompt, 1 sample(s) per batch, 28 total batch(es).
  0%|                                                                                     | 0/28 [00:00<?, ?it/s]/usr/local/lib/python3.10/dist-packages/diffusers/models/unets/unet_2d_blocks.py:1369: FutureWarning: `scale` is deprecated and will be removed in version 1.0.0. The `scale` argument is deprecated and will be ignored. Please remove it, as passing it will raise an error in the future. `scale` should directly be passed while calling the underlying pipeline component i.e., via `cross_attention_kwargs`.
  deprecate("scale", "1.0.0", deprecation_message)
/usr/local/lib/python3.10/dist-packages/diffusers/models/unets/unet_2d_blocks.py:2628: FutureWarning: `scale` is deprecated and will be removed in version 1.0.0. The `scale` argument is deprecated and will be ignored. Please remove it, as passing it will raise an error in the future. `scale` should directly be passed while calling the underlying pipeline component i.e., via `cross_attention_kwargs`.
  deprecate("scale", "1.0.0", deprecation_message)
 57%|███████████████████████████████████████████▍                                | 16/28 [22:37<15:14, 76.25s/it]

huggingface / optimum-habana