[Feature Request] Support DLPack protocol in non-training builds

Describe the feature request

Currently, the DLPack protocol can only be used in non-training build. See:

https://github.com/microsoft/onnxruntime/issues/4162
https://github.com/microsoft/onnxruntime/issues/10238
https://github.com/microsoft/onnxruntime/issues/10286
https://github.com/microsoft/onnxruntime/issues/10327
https://github.com/microsoft/onnxruntime/issues/14625

I believe it would make sense to enable this in the main build and not only the training one. Many AI modules already support this:

torch: https://pytorch.org/docs/stable/dlpack.html
tenserflow: https://www.tensorflow.org/api_docs/python/tf/experimental/dlpack/from_dlpack
numpy: https://numpy.org/devdocs/reference/generated/numpy.from_dlpack.html#numpy.from_dlpack
cupy: https://docs.cupy.dev/en/stable/reference/generated/cupy.fromDlpack.html
...

Having DLPack support in onnxruntime allows us to have "zero" cost copys between these modules. This is not only interesting during training. Often, multiple models are used in which case the ouput of one model will be used as the input of the next one. When we want to do postprocessing/preprocessing on these models we currently can't do this without moving them to CPU using .numpy(). This comes with a significant performce cost.

Describe scenario use case

We want to use cupy for processing our model inbetween different inference runs. Cupy supports the DLPack protocol which would allow us to do so. One option would be to build with training support but this makes our package size quite a bit bigger which I'd like to avoid.

microsoft / onnxruntime

[Feature Request] Support DLPack protocol in non-training builds #15963

Describe the feature request

Describe scenario use case