pykeio / ort

Fast ML inference & training for Rust with ONNX Runtime
https://ort.pyke.io/
Apache License 2.0
786 stars 91 forks source link

Prebuilt binaries for common combination of CUDA 12 & cudnn8, in addition to 12 & cudnn9 #235

Closed zopieux closed 2 months ago

zopieux commented 2 months ago

It was recently documented (well it doesn't work really per #234, but assuming it eventually does) that CUDA 11 depends on cudnn8; while CUDA 12 depends on cudnn9.

I would like to point out that given how large these development & runtime environments are, relying on prebuilt images such as those found on Docker hub from "official" sources is often a necessity.

By choosing to only support CUDA 12 + cudnn9, this prevents e.g. the popular pytorch images to run ort, because as you can see in this list, only cuda12.N-cudnn8 is available.

If not otherwise technically imposed by onnxruntime requirements, would it be possible to provide prebuilt provider binaries for those combinations:

CUDA cudnn State
11 8 already available
12 8 new, this issue
12 9 already available

And, in general, try to follow the versions combinations from popular prebuilt runtime environments like pytorch?

Thanks for the awesome work!

decahedron1 commented 2 months ago

Not sure how to detect cuDNN 8/9 like we can detect CUDA 11/12. Would an environment variable to override this be OK? (I see the pytorch containers set NV_CUDNN_VERSION, so it could read that too.)

zopieux commented 2 months ago

Ah I see. Yes, an env variable would work! Thanks.

decahedron1 commented 2 months ago

@zopieux Could you please test https://github.com/pykeio/ort/commit/bc764d388f213a631db934029af6463f19cacd5e?

[dependencies]
ort = { git = "https://github.com/pykeio/ort.git", rev = "bc764d388f213a631db934029af6463f19cacd5e" }
zopieux commented 2 months ago

Using the above commit, inside Docker image docker.io/pytorch/pytorch:2.3.1-cuda12.1-cudnn8-runtime and forcing ORT_CUDNN_VERSION=8 (it's not set in that image AFAICT), cargo build succeeds.

Then at runtime, toying with LD_LIBRARY_PATH because these libraries are _fucking all over the place except where LD_LIBRARYPATH is natively set, it works! :rocket:

$ export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/app/target/debug:/opt/conda/lib:/opt/conda/lib/python3.10/site-packages/torch/lib"
$ cargo run && echo cool
cool

Thanks!

decahedron1 commented 2 months ago

and forcing ORT_CUDNN_VERSION=8 (it's not set in that image AFAICT)

Hmm, it should've picked up NV_CUDNN_VERSION, which the image does set, but oh well, at least it works at all 😂

Fixed in v2.0.0-rc.4

zopieux commented 2 months ago

NV_CUDNN_VERSION, which the image does set

Wait, am I missing something?

$ podman run -ti --device nvidia.com/gpu=all --rm docker.io/pytorch/pytorch:2.3.1-cuda12.1-cudnn8-runtime \
    bash -c 'env | grep -E "(CUDA|CUDNN|NV|TORCH)"'
NVIDIA_VISIBLE_DEVICES=all
NVIDIA_DRIVER_CAPABILITIES=compute,utility
PYTORCH_VERSION=2.3.1

Oh, did you use the devel flavor by any chance?

decahedron1 commented 2 months ago

Oh, did you use the devel flavor by any chance?

...shit 🙃

zopieux commented 2 months ago

It's okay, I guess it's working as intended that to build a rust library, one would need the devel flavor. :)