microsoft / planetary-computer-containers

Container definitions for the Planetary Computer
MIT License
53 stars 12 forks source link

Getting onnxruntime to work with CUDAExecutionProvider on gpu-pytorch container #33

Closed weiji14 closed 2 years ago

weiji14 commented 2 years ago

Hi again, just trying to use onnxruntime to run a neural network as a follow up from https://github.com/microsoft/planetary-computer-containers/issues/32#issuecomment-1100211839. The CPU execution works fine, but it seems that the GPU execution isn't working for some reason.

Steps to reproduce on the gpu-pytorch container.

pip install onnxruntime-gpu

then restart the kernel before running the below

import onnxruntime

print(onnxruntime.__version__)
print(onnxruntime.get_available_providers())
# 1.11.0
# ['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider']

so it seems to know there there is a CUDA-capable GPU. But when I try to get an onnxruntime session going, it only picks up the CPU. Get a sample .onnx file, e.g. from https://media.githubusercontent.com/media/onnx/models/main/vision/object_detection_segmentation/tiny-yolov2/model/tinyyolov2-7.onnx

ort_session = onnxruntime.InferenceSession(
    path_or_bytes="tinyyolov2-7.onnx",
    providers=["CUDAExecutionProvider", "CPUExecutionProvider"],
)
input_name = ort_session.get_inputs()[0].name
print(input_name)

produces a warning:

2022-04-15 15:09:38.624858540 [W:onnxruntime:Default, onnxruntime_pybind_state.cc:552 CreateExecutionProviderInstance] Failed to create CUDAExecutionProvider. Please reference https://onnxruntime.ai/docs/reference/execution-providers/CUDA-ExecutionProvider.html#requirements to ensure all dependencies are met.

Looking at the output of nvidia-smi though, the CUDA version is 11.0 which should be ok if I understand https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements correctly:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000001:00:00.0 Off |                  Off |
| N/A   30C    P8    11W /  70W |      0MiB / 16127MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

So I'm wondering if there's some other library that needs to be added to the container to make onnxruntime's GPU execution work. Maybe related to https://github.com/microsoft/onnxruntime/issues/11092

Another thing I'd like to ask if there's room to get onnxruntime into the gpu-pytorch image? Happy to submit a pull request to add it in.

TomAugspurger commented 2 years ago

Can you try !mamba install -y -c conda-forge onnxruntime to see if that does the trick?

If that's successful I'll get it added to the gpu-pytorch image.

weiji14 commented 2 years ago

Can you try !mamba install -y -c conda-forge onnxruntime to see if that does the trick?

Nope, doesn't work. The conda-forge onnxruntime seems to be CPU only for now, need to wait for https://github.com/conda-forge/onnxruntime-feedstock/pull/7 to be merged.

I did manage to get it to work by updating cudatoolkit from 10.2 to 11.6 like so:

!mamba update -y cudatoolkit
!pip install onnxruntime-gpu

i.e. this line in the lockfile needs to change:

https://github.com/microsoft/planetary-computer-containers/blob/55593e0a10427278ef00b2629f8e35f069016efd/gpu-pytorch/conda-linux-64.lock#L42

Is the plan to stick with CUDA 10.2? Or can the next container update use a newer CUDA version >11?

TomAugspurger commented 2 years ago

Thanks.

We should be able to update to CUDA 11.x. I'll take a look at that this week.