Can't get ORTStableDiffusionPipeline to run on GPU on neither AWS nor GCP fresh instances

System Info

The same problem manifests on both of these systems:

System 1
---
- Amazon EC2 instance
- Type: g5.2xlarge
- Image: Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.2.0 (Amazon Linux 2)
- Python 3.10.9
- optimum==1.19.1
- onnx==1.16.0
- onnxruntime-gpu==1.16.3 (won't let me install 1.17)

System 2
---
- GCP NVIDIA A100/40GB instance
- Type: a2-highgpu-1g
- Image: deeplearning-vm
- Python 3.10.13
- optimum==1.19.1
- onnx==1.16.0
- onnxruntime-gpu==1.17.1

Who can help?

Pipelines: @philschmid ONNX Runtime: @JingyaHuang, @echarlaix

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

Step 1: Verify that ONNX sees the GPU:

>>> import onnxruntime as ort
>>> print(ort.get_device())
GPU
>>> print(ort.get_available_providers())
['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'AzureExecutionProvider', 'CPUExecutionProvider']

Step 2: Attempt to run this official example using the 'CUDAExecutionProvider':

from optimum.onnxruntime import ORTStableDiffusionPipeline
pipeline = ORTStableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", revision="onnx", provider="CUDAExecutionProvider")

Error:

2024-05-03 19:32:16.794901908 [W:onnxruntime:Default, onnxruntime_pybind_state.cc:861 CreateExecutionProviderInstance] Failed to create CUDAExecutionProvider. Please reference https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirementsto ensure all dependencies are met.

From what I can tell, the requirements are met, at least on the GCP setup: CUDA version = 12.1 cuDNN version = 8.9 onnxruntime version = 1.17

Note that, if I omit the provider, the pipeline will run on the CPU (I can tell because one image generation takes ~3 minutes, and nvidia-smi shows no activity on the GPU).

Expected behavior

pipeline = ORTStableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", revision="onnx", provider="CUDAExecutionProvider") should run with no errors.

huggingface / optimum