NAN values produced by SDXL VAE encoder

Describe the bug

I'd like to use the SDXL VAE to encode my image, but only got NAN values. I have set the input and the vae to full precision (torch.float32), but problem still exists.

Reproduction

import torch
from diffusers import StableDiffusionXLPipeline
from diffusers import DPMSolverMultistepScheduler
import numpy as np
from PIL import Image
from torch import autocast, inference_mode

from PIL import Image
from torchvision import transforms as tr
p2t = tr.ToTensor()

device = torch.device('cuda') if torch.cuda.is_available() else torch.device(
    'cpu')
NUM_DDIM_STEPS = 50
SKIP = 0.0
ETA=1
TOTAL_STEP = int(NUM_DDIM_STEPS * (1 + SKIP))
model_id = 'stabilityai/stable-diffusion-xl-base-1.0'
ldm_stable = StableDiffusionXLPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to(device)
ldm_stable.scheduler = DPMSolverMultistepScheduler.from_config(model_id, subfolder = "scheduler", algorithm_type="sde-dpmsolver++", solver_order=2)
ldm_stable.scheduler.config.timestep_spacing = "leading"
ldm_stable.scheduler.set_timesteps(TOTAL_STEP)

image_gt = Image.open('path/to/image.png').convert('RGB')
image_gt = image_gt.resize((1024, 1024))
image_gt = p2t(image_gt) * 2 - 1
image_gt = image_gt.unsqueeze(0).to(device, dtype = torch.float32)

ldm_stable.vae.to(dtype=torch.float32)
with autocast("cuda"), inference_mode():
    w0 = ldm_stable.vae.encode(image_gt).latent_dist.sample()
    print(w0)

Logs

Loading pipeline components...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:09<00:00,  1.36s/it]
/root/miniforge3/lib/python3.10/site-packages/diffusers/configuration_utils.py:245: FutureWarning: It is deprecated to pass a pretrained model name or path to `from_config`.If you were trying to load a scheduler, please use <class 'diffusers.schedulers.scheduling_dpmsolver_multistep.DPMSolverMultistepScheduler'>.from_pretrained(...) instead. Otherwise, please make sure to pass a configuration dictionary instead. This functionality will be removed in v1.0.0.
  deprecate("config-passed-as-path", "1.0.0", deprecation_message, standard_warn=False)
tensor([[[[nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          ...,
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan]],

         [[nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          ...,
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan]],

         [[nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          ...,
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan]],

         [[nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          ...,
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan],
          [nan, nan, nan,  ..., nan, nan, nan]]]], device='cuda:0')

System Info

Diffusers: 0.30.0 Pytorch: 1.12 transforms: 4.45.2 No XFormers

Running on RTX 3090Ti CUDA Version: 11.7

Python version 3.10.14

Who can help?

@yiyixuxu @sayakpaul @DN6

huggingface / diffusers

NAN values produced by SDXL VAE encoder #9844

Describe the bug

Reproduction

Logs

System Info

Who can help?