originrose / cortex

Machine learning in Clojure
Eclipse Public License 1.0
1.27k stars 110 forks source link

Cortex projects are bound to a specific cuda library version #107

Open cnuernber opened 7 years ago

cnuernber commented 7 years ago

This has bothered me for some time and there isn't too much I can do about it but here we go:

The runtime dependency on the cuda libraries is not ideal the way it is structured.

  1. If the user does not have cuda installed the entire system now fails.
  2. If the user does not have cudnn installed the entire system now fails.
  3. If the user has a newer (older) version of cuda installed the entire system fails at startup. Regardless of the fact that we aren't using and blisteringly new features of cuda.

What people have done for many years with opengl is they bind to the actual shared library dynamically. They then look for the symbols they need in the shared library and those symbols along with the version of opengl detected (with an API call from the library) then dictates their path forward. They dynamically switch rendering paths depending on the feature set available in opengl and often times the specific hardware features available on the card.

Because the binding is dynamic, the program will start start of opengl isn't present but will exit with a nice error message. Also, because the binding is dynamic and they search for specific symbols in the shared library they can have one wrapper library that binds to several versions of opengl and it just exposes the symbols it finds.

This is the ideal situation. Currently in cortex for instance you have the change the project.clj in order to bind to a different version of cuda despite the fact that we aren't using any new features in that version and thus from a dynamic linking perspective this is unnecessary. This is a completely unnecessary incidental complexity that will come back to bite at some point.

The right answer here is to use an intermediate library that can do dynamic loading across the different platforms and find the symbols. You then set global pointers to the symbol value if it is found or not if it is not found (see gl wrangler: http://glew.sourceforge.net/).

Then we at least allow the program to decide if cuda is a necessary dependency and furthermore if particular versions of cuda (and cudnn, npp, cublas) are necessary dependencies What is stopping me from going there is a proper cross platform build system where I can build a library for at least linux, mac, and windows. That and the time required to actually do this.

There may be a solution in the dynamic linking facilities now present in Java but that path needs to be researched. To do this with javacpp we would need to build a small wrapper library that did the dynamic binding to the shared libraries and the symbols in the shared libraries.

In any case, a best-in-class CUDA development system would not have this issue. I suspect the same type of issue would be present should we decide to put effort into opencl.

harold commented 7 years ago

GLEW is an interesting parallel here but I think it solves a harder problem than the one we have in an environment (gl extensions) that was designed to be helpful in solving the problem.

I'd advocate at least trying smarter Java stuff before going into a full cross-platform wrapper library mode. Though, I also 100% agree that the problem could be thoroughly solved in that way (at some cost).

cnuernber commented 7 years ago

It does have to solve the extension problem but it also has to solve the which symbols are in the library problem. For example, there are different symbols available in GL3 than GL4.

In the sense that it needs to do dynamic symbol resolution at runtime after loading an indeterminate version of a shared library it is the same problem.

So, there are couple dynamic java things but the one that seems most promising is here and I agree understanding this route may help and would avoid the need for a cross platform build system especially considering the cuda bindings are all c interfaces:

https://github.com/jnr/jnr-ffi

On the other hand it isn't nearly as general as it will have limitations w/r/t the types of headers it understands and most likely binding with c++ so in the long run the cross platform build system will allow a higher quality ecosystem of bindings assuming someone wants to build/maintain it.

harold commented 7 years ago

Sounds right. I know @charlesg3 messed with jnr a bit when doing XGBoost---not sure what the results there were.

charlesg3 commented 7 years ago

jnr does allow dynamic binding to a library and doesn't require any information about headers... as such it sort of keeps the external library outside of your concern (just that it needs to exist on the LD_LIBRARY_PATH of the system)... which is super nice / minimal. The downside is that jnr has a bit of a low-level feel to it and the documentation is a bit sparse.

benkamphaus commented 7 years ago

In case it's relevant for this discussion, there was a talk this year at Conj (2016) that focused pretty heavily on using jnr from Clojure.