uncomplicate / neanderthal

Fast Clojure Matrix Library
http://neanderthal.uncomplicate.org
Eclipse Public License 1.0
1.07k stars 56 forks source link

Not able to run native tests of neanderthal sucessfull #127

Closed behrica closed 2 years ago

behrica commented 2 years ago

While we discussed https://github.com/uncomplicate/deep-diamond/issues/15 the issue that Neanderthal does not find any more the libmkl_rt.so (even when globaly installed) came up as an other issue.

I prepared a Dockerfile which exposed the issue, maybe useful.

# failing with
# Execution error (UnsatisfiedLinkError) at java.lang.ClassLoader$NativeLibrary/load0 (ClassLoader.java:-2).
#/tmp/libneanderthal-mkl-0.33.07653633467081296505.so: libmkl_rt.so: cannot open shared object file: No such file or directory

FROM clojure:lein-2.9.8-focal
RUN apt-get update && apt-get -y install git wget python3
RUN wget https://registrationcenter-download.intel.com/akdlm/irc_nas/18721/l_onemkl_p_2022.1.0.223.sh
RUN sh ./l_onemkl_p_2022.1.0.223.sh -a --silent  --eula accept

RUN git clone https://github.com/uncomplicate/neanderthal.git

WORKDIR /tmp/neanderthal
RUN git checkout e01511ff47605f2e4031d58899b303e4435d58e3
RUN lein test uncomplicate.neanderthal.mkl-test
behrica commented 2 years ago

Even adding the bytecode dependency does it not make working:

lein update-in :dependencies conj "[org.bytedeco/mkl-platform-redist \"2022.0-1.5.7\"]" -- test uncomplicate.neanderthal.mkl-test
behrica commented 2 years ago

So it seems to me that neither with lastet global MKL nor latest [org.bytedeco/mkl-platform-redist "] neanderthal native is working.

behrica commented 2 years ago

Using older bytedeco (as documented here: https://neanderthal.uncomplicate.org/articles/getting_started.html= (but keeping MKL globally installed) fails with an othe errror)

# failing with
# actual result:
# clojure.lang.ExceptionInfo: LAPACK error. {:bad-argument 5, :error-code -5}
#  uncomplicate.neanderthal.internal.host.mkl.FloatSYEngine.copy(mkl.clj:2065)

FROM clojure:lein-2.9.8-focal
RUN apt-get update && apt-get -y install git wget python3
RUN wget https://registrationcenter-download.intel.com/akdlm/irc_nas/18721/l_onemkl_p_2022.1.0.223.sh
RUN sh ./l_onemkl_p_2022.1.0.223.sh -a --silent  --eula accept

RUN git clone https://github.com/uncomplicate/neanderthal.git

WORKDIR /tmp/neanderthal
RUN git checkout e01511ff47605f2e4031d58899b303e4435d58e3

CMD  ["lein", "update-in" ,":dependencies", "conj" ,"[org.bytedeco/mkl-platform-redist \"2020.3-1.5.4\"]" ,"--", "test" ,"uncomplicate.neanderthal.mkl-test" ]

-->

actual result did not agree with the checking function.
Actual result:
clojure.lang.ExceptionInfo: LAPACK error. {:bad-argument 5, :error-code -5}
  uncomplicate.neanderthal.internal.host.mkl.FloatSYEngine.copy(mkl.clj:2065)
  uncomplicate.neanderthal.internal.host.buffer_block.RealUploMatrix.host(buffer_block.clj:1243)
behrica commented 2 years ago

Same without MKL installed:

# failing with
# actual result:
# clojure.lang.ExceptionInfo: LAPACK error. {:bad-argument 5, :error-code -5}
#  uncomplicate.neanderthal.internal.host.mkl.FloatSYEngine.copy(mkl.clj:2065)

FROM clojure:lein-2.9.8-focal
RUN apt-get update && apt-get -y install git wget python3

RUN git clone https://github.com/uncomplicate/neanderthal.git

WORKDIR /tmp/neanderthal
RUN git checkout e01511ff47605f2e4031d58899b303e4435d58e3

CMD  ["lein", "update-in" ,":dependencies", "conj" ,"[org.bytedeco/mkl-platform-redist \"2020.3-1.5.4\"]" ,"--", "test" ,"uncomplicate.neanderthal.mkl-test" ]
blueberry commented 2 years ago

What I don't really get is the whole .1 and .2 suffix to libmkl_rt.so in newer versions. I understand that it's a versioning thing, but the official documentation in these newer version (https://www.intel.com/content/dam/develop/external/us/en/documents/onemkl-developerguide-linux.pdf) explicitly states `libmkl_rt" as the build dependency, exactly what I was always using to build neanderthal-mkl...

I guess that I'll have to see how to re-build neanderthal to the latest MKL, and distribute that version as the "official" one. This should probably require users to upgrade their MKL to the recent one, too.

behrica commented 2 years ago

I have seen that they symlink to each other ".1 and .2 suffix to libmkl_rt.so in" in the lastet version. So I passed the point and now it finds it. But I amgeeting other error know, each configuration I take, an other error.

behrica commented 2 years ago

Doing this, finds the libmkl_rt.so , but it fails on something else, see in comment.

# failing with
#Actual result did not agree with the checking function.
#Actual result:
#clojure.lang.ExceptionInfo: LAPACK error. {:bad-argument 5, :error-code -5}
#  uncomplicate.neanderthal.internal.host.mkl.FloatSYEngine.copy(mkl.clj:2065)

FROM clojure:lein-2.9.8-focal
RUN apt-get update && apt-get -y install git wget python3
#RUN wget https://registrationcenter-download.intel.com/akdlm/irc_nas/18721/l_onemkl_p_2022.1.0.223.sh
RUN wget https://registrationcenter-download.intel.com/akdlm/irc_nas/18483/l_onemkl_p_2022.0.2.136.sh
RUN sh ./l_onemkl_p_2022.0.2.136.sh -a --silent  --eula accept

RUN git clone https://github.com/uncomplicate/neanderthal.git

WORKDIR /tmp/neanderthal
RUN git checkout e01511ff47605f2e4031d58899b303e4435d58e3

ENV LD_LIBRARY_PATH="/opt/intel/oneapi/mkl/2022.0.2/lib/intel64"
CMD [ "lein", "update-in", ":dependencies" ,"conj", "[org.bytedeco/mkl-platform-redist \"2022.0-1.5.7\"]", "--", "test", "uncomplicate.neanderthal.mkl-test" ]
behrica commented 2 years ago

If I do not use "bytedeco", I get an other error: Here I found an a bit older version of MKL which matches the "bytedeco" version number

# failing with
#java: symbol lookup error: /opt/intel/oneapi/mkl/2022.0.2/lib/intel64/libmkl_intel_thread.so.2: undefined symbol: omp_get_num_procs
#Tests failed.

FROM clojure:lein-2.9.8-focal
RUN apt-get update && apt-get -y install git wget python3
#RUN wget https://registrationcenter-download.intel.com/akdlm/irc_nas/18721/l_onemkl_p_2022.1.0.223.sh
RUN wget https://registrationcenter-download.intel.com/akdlm/irc_nas/18483/l_onemkl_p_2022.0.2.136.sh
RUN sh ./l_onemkl_p_2022.0.2.136.sh -a --silent  --eula accept

RUN git clone https://github.com/uncomplicate/neanderthal.git

WORKDIR /tmp/neanderthal
RUN git checkout e01511ff47605f2e4031d58899b303e4435d58e3

ENV LD_LIBRARY_PATH="/opt/intel/oneapi/mkl/2022.0.2/lib/intel64"
CMD [ "lein", "test", "uncomplicate.neanderthal.mkl-test" ]
behrica commented 2 years ago

This version of MKL l_onemkl_p_2022.1.0.223.sh has in directory:

/opt/intel/oneapi/mkl/2022.1.0/lib/intel64

lrwxrwxrwx 1 root root        14 Mar 29 15:07 libmkl_rt.so -> libmkl_rt.so.2
-rwxr-xr-x 1 root root  11300224 Mar 11 08:07 libmkl_rt.so.2
-rw-r--r-- 1 root root  12244638 Mar 11 08:07 libmkl_scalapack_ilp64.a
behrica commented 2 years ago

and indeed , by setting LD_LIBRY_PATH to "/opt/intel/oneapi/mkl/2022.1.0/lib/intel64", it seems to find it. But gives an other error: java: symbol lookup error: /opt/intel/oneapi/mkl/2022.1.0/lib/intel64/libmkl_intel_thread.so.2: undefined symbol: omp_get_num_procs

behrica commented 2 years ago

I give up at this point in time. I cannot find a setup of MKL, "org.bytedeco/mkl-platform-redist" and config following the instructions which makes the neanderthal tests pass in "native". So I am wondering what setup the people here are using.

Probably some "old" setup, which "today" cannot be re-created anymore. (as library versions are gone)

blueberry commented 2 years ago

It seems to me that your MKL distribution misses libiomp5.so, or you have a wrong iomp5 installation when you set up that manually? Please see the mention of that lib in https://neanderthal.uncomplicate.org/articles/getting_started.html.

This should and usually is automatically in the right place, but it might be broken in some installations, as people create countless variations.

behrica commented 2 years ago

I am just looking at that. It is true that my Dockerimage has less things then a "normal OS".

behrica commented 2 years ago

I switched know to install mkl as debian package, does not make it better: ->

Actual result did not agree with the checking function.
Actual result:
clojure.lang.ExceptionInfo: LAPACK error. {:bad-argument 5, :error-code -5}
  uncomplicate.neanderthal.internal.host.mkl.FloatSYEngine.copy(mkl.clj:2065)
  uncomplicate.neanderthal.internal.host.buffer_block.RealUploMatrix.host(buffer

FROM clojure:lein-2.9.8-focal
RUN apt-get update && apt-get -y install git wget python3 intel-mkl
RUN git clone https://github.com/uncomplicate/neanderthal.git

WORKDIR /tmp/neanderthal
RUN git checkout e01511ff47605f2e4031d58899b303e4435d58e3

#ENV LD_LIBRARY_PATH="/opt/intel/oneapi/mkl/2022.0.2/lib/intel64"
CMD [ "lein", "update-in", ":dependencies" ,"conj", "[org.bytedeco/mkl-platform-redist \"2022.0-1.5.7\"]", "--", "test", "uncomplicate.neanderthal.mkl-test" ]
behrica commented 2 years ago

Ok, I finally found a working solution, in teh form of a Dockerfile:

FROM clojure:lein-2.9.8-focal
RUN apt-get update
RUN DEBIAN_FRONTEND=noninteractive apt-get -y install git wget python3 intel-mkl

RUN git clone https://github.com/uncomplicate/neanderthal.git

WORKDIR /tmp/neanderthal
RUN git checkout e01511ff47605f2e4031d58899b303e4435d58e3

CMD [ "lein", "test", "uncomplicate.neanderthal.mkl-test" ]

Very simple, even.

behrica commented 2 years ago

But it is not true, in my view, as the instructions here suggest, that no mkl installation is needed, when adding the "bytedeco"


Add a MKL distribution jar [org.bytedeco/mkl-platform-redist "2020.3-1.5.4"] as your project’s dependency.

Neanderhtal will use the native CPU MKL binaries from that jar automatically, so you don’t need to do anything else


This does fail:

FROM clojure:lein-2.9.8-focal
RUN apt-get update
RUN DEBIAN_FRONTEND=noninteractive apt-get -y install git

RUN git clone https://github.com/uncomplicate/neanderthal.git

WORKDIR /tmp/neanderthal
RUN git checkout e01511ff47605f2e4031d58899b303e4435d58e3

CMD [ "lein", "update-in", ":dependencies" ,"conj", "[org.bytedeco/mkl-platform-redist \"2020.3-1.5.4\"]", "--", "test", "uncomplicate.neanderthal.mkl-test" ]

with

Actual result:
clojure.lang.ExceptionInfo: LAPACK error. {:bad-argument 5, :error-code -5}
  uncomplicate.neanderthal.internal.host.mkl.FloatSYEngine.copy(mkl.clj:2065)
  uncomplicate.neanderthal.internal.host.buffer_block.RealUploMatrix.host(buffer_block.clj:124

So I would say it requires "a lot of luck", if adding "[org.bytedeco/mkl-platform-redist "2020.3-1.5.4"]" and "doing nothing else" is indeed working.

behrica commented 2 years ago

It seem to me that the installation of "intel-mkl" via "apt" does more then only putting the required ".so" files somewhere. (which the bytedeco jar can only do) It saw a lot of things happening during installation of "intel-mkl" via apt about replace lpack related libraries with some other stuff. The interactive installation is asking 3 or 4 questions about this.

blueberry commented 2 years ago

I always recommend installing intel mkl globally as this is what I use. Everything else is something that people ask me to support and I am trying to satisfy these demands as much as I can. Any help in that regard is always welcome, but there the ground moves from time to time.

blueberry commented 2 years ago

It seem to me that the installation of "intel-mkl" via "apt" does more then only putting the required ".so" files somewhere. (which the bytedeco jar can only do) It saw a lot of things happening during installation of "intel-mkl" via apt about replace lpack related libraries with some other stuff. The interactive installation is asking 3 or 4 questions about this.

The stuff that you see is needed only for building native dependencies, which is what I need. For using neanderthal, only the visibility of the appropriate .so files should be enough (I've tested this multiple times on multiple OSes, but who knows ;)

behrica commented 2 years ago

One way to address this is to try to maintain a single Docker image for the Clojure Data Science community. I do this in some form here: https://github.com/behrica/clj-py-r-template/blob/master/docker-base/Dockerfile

It is setup to allow the R and python bindings to Clojure to work out of the box.

I know that the Clojure community is not a very big fan of Docker based development, but maybe it is worth to extend the above docker image to explicitly support neanderthal and therefore deep diamond out of the box.

What to do you think ?

I could give it a go and try to setup all needed stuff for deep-diamond in there as well.

blueberry commented 2 years ago

Of course it would be good to have it as an option. I don't use docker, but some people certainly prefer it, so I don't see how sharing 3rd party setups could hurt. It would be best if you could set it up as a github repo, and link it here.

jsa-aerial commented 2 years ago

@behrica Hi Carsten, I have the needed MKL libs for Linux, Mac, and Win that I created for installation for Saite. They are all in compressed archives. These have always worked for me across various machines, and OS versions (only Intel Mac - no new Arm stuff) and Win10 for Windows. For Linux and (Intel) Mac, aerosaite, the self installing uberjar variant, comes with scripts for running it that setup the paths for the MKL. This too, has always worked for various users. Aerosaite automatically downloads and installs the MKL libs to a local directory relative to the .saite home directory. BUT, you could manually grab these if you wish and install them in some similar location that makes sense for you.

Linux Mac Win

I am unsure about how to automatically set the path for Win (someone recently gave me an idea of what it should be so maybe the next release the Win scripts will have that as well).

I'm not sure if your setup is 'special' in some way that would keep this from working, but it may be worth a try. As I say, this has always worked. The scripts are in the resources folder at the aerosaite github (link above).

blueberry commented 2 years ago

Hi @jsa-aerial that is really helpful. Maybe we can make this or some more focused standalone version of this an official recommendation for people that for some reason or another can't make the official vendor binaries work on their system?

blueberry commented 2 years ago

Just a quick note: Neanderthal's MKL dependency does not need any installation other than the lib files being in any location where the appropriate OS looks for shared libraries. Even copy/paste works.

behrica commented 2 years ago

@jsa-aerial I do agree that we should have more "instructions" / variants to get MKL installed (and deep-diamond working) I "nearly" managed to extend the polyglot Dockerfile to have all working: With "all" I mean deep-diamond running in a Dockercontainer supporting "native = MKL", "CUDA GPL" and "OpenCL GPU".

"Working" I measure by have the all deep-diamond tests passing.

Even for non Docker users, reading the Dockerfile can be useful: See last sections here: https://github.com/behrica/clj-py-r-template/blob/master/docker-base/Dockerfile

It is nearly working... I dived very deep into the issues, and tried a lot of different things.

It would be very helpful, if somebody with more knowledge on CUDA / OpenCL / Linux would have a look.

The dockerfile can be build as usual with docker build . and run on (hopefully) any machine with a GPU, as docker run --gpus all -w /tmp/deep-diamond <imgae-id> lein test

Currently I get this error, and I am not sure what to try next.

Execution error (UnsatisfiedLinkError) at java.lang.ClassLoader$NativeLibrary/load0 (ClassLoader.java:-2).
/root/.javacpp/cache/opencl-3.0-1.5.7-linux-x86_64.jar/org/bytedeco/opencl/linux-x86_64/libjniOpenCL.so: /usr/lib/x86_64-linux-gnu/libOpenCL.so: version `OPENCL_2.2' not found (required by /root/.javacpp/cache/opencl-3.0-1.5.7-linux-x86_64.jar/org/bytedeco/opencl/linux-x86_64/libjniOpenCL.so)

Could somebody help out with this ?

behrica commented 2 years ago

As you see in the Dockerfile I settled on Cuda 11.4, with 11.6 I had even more weiredd issues and did not come "this far".

That won't work as ClojureCUDA is tied to specific CUDA version that should be installed on your machine in addition to nvidia drivers. This is currently 11.6.1

Additionally, Deep Diamond requires Nvidia's cuDNN too. On Arch Linux, both are available as packages (cuda and cudnn) through pacman. On other systems, they are fairly widely available, and nvidia offer click-through installers too on their main website.

Pleasesee details at clojurecuda web page.

behrica commented 2 years ago

Just a quick note: Neanderthal's MKL dependency does not need any installation other than the lib files being in any location where the appropriate OS looks for shared libraries. Even copy/paste works.

The question is "which precise shared libraries" it needs. It seems to me, that given on how MKL is installed, "different libraries" do get installed in the appropriate places. And I had issues with wrong GLIBC versions and so on.

behrica commented 2 years ago

I would advocate a Docker solution, as "out-of-the-box" and the quickest route to "try deep-diamond". (at least for the Linux users with Docker ...)

behrica commented 2 years ago

@behrica Hi Carsten, I have the needed MKL libs for Linux, Mac, and Win that I created for installation for Saite. They are all in compressed archives. These have always worked for me across various machines, and OS versions (only Intel Mac - no new Arm stuff) and Win10 for Windows. For Linux and (Intel) Mac, aerosaite, the self installing uberjar variant, comes with scripts for running it that setup the paths for the MKL. This too, has always worked for various users. Aerosaite automatically downloads and installs the MKL libs to a local directory relative to the .saite home directory. BUT, you could manually grab these if you wish and install them in some similar location that makes sense for you.

Linux Mac Win

Interesting approach. In my Dockerfile, I do the other pathway. I download and run the official installers in "silent mode".

As we are in a fixed environment (Fixed OS in Docker container), this should as well always work.

behrica commented 2 years ago

Usualy I put as well a Dockerimage in Dockerhub, https://hub.docker.com/repository/docker/behrica/clj-py-r so this would be the most "out-of-the-box" working solution.

the idea of the Dockercontainer is to expose a Clojure nREPL on a given port, to which you can connect to from Cider / Calva or whatever nREPL client.

This works nicely as well on a remote machine, tunneling the nREPL port via ssh. Given how Clojure works, it is often not needed to have the source files on the remote machine / Docker, but depends what the code does.

Setting this "file sync" up is the most complicated part, but supported and documented in other places.

blueberry commented 2 years ago

@jsa-aerial I do agree that we should have more "instructions" / variants to get MKL installed (and deep-diamond working) I "nearly" managed to extend the polyglot Dockerfile to have all working: With "all" I mean deep-diamond running in a Dockercontainer supporting "native = MKL", "CUDA GPL" and "OpenCL GPU".

"Working" I measure by have the all deep-diamond tests passing.

Even for non Docker users, reading the Dockerfile can be useful: See last sections here: https://github.com/behrica/clj-py-r-template/blob/master/docker-base/Dockerfile

It is nearly working... I dived very deep into the issues, and tried a lot of different things.

It would be very helpful, if somebody with more knowledge on CUDA / OpenCL / Linux would have a look.

The dockerfile can be build as usual with docker build . and run on (hopefully) any machine with a GPU, as docker run --gpus all -w /tmp/deep-diamond <imgae-id> lein test

Currently I get this error, and I am not sure what to try next.

Execution error (UnsatisfiedLinkError) at java.lang.ClassLoader$NativeLibrary/load0 (ClassLoader.java:-2).
/root/.javacpp/cache/opencl-3.0-1.5.7-linux-x86_64.jar/org/bytedeco/opencl/linux-x86_64/libjniOpenCL.so: /usr/lib/x86_64-linux-gnu/libOpenCL.so: version `OPENCL_2.2' not found (required by /root/.javacpp/cache/opencl-3.0-1.5.7-linux-x86_64.jar/org/bytedeco/opencl/linux-x86_64/libjniOpenCL.so)

Could somebody help out with this ?

OpenCL platform loader has to be present on the system if you want to use opencl with Neanderthal. Please note that deep diamond only supports dense layers for OpenCL via a general Neanderthal engine. Please see ClojureCL documentation for details

behrica commented 2 years ago

What do you mean by "OpenCL platform loader has to be present " ? I have "clinfo" installed, if this is what you mean. And it runs and tells me something about my openCL driver, I believe:

Number of platforms                               1
  Platform Name                                   NVIDIA CUDA
  Platform Vendor                                 NVIDIA Corporation
  Platform Version                                OpenCL 3.0 CUDA 11.6.110
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_icd cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_nv_kernel_attribute cl_khr_device_uuid cl_khr_pci_bus_info cl_khr_external_semaphore cl_khr_external_memory cl_khr_external_semaphore_opaque_fd cl_khr_external_memory_opaque_fd
  Platform Host timer resolution                  0ns
  Platform Extensions function suffix             NV

  Platform Name                                   NVIDIA CUDA
Number of devices                                 1
  Device Name                                     Tesla P40
  Device Vendor                                   NVIDIA Corporation
  Device Vendor ID                                0x10de
  Device Version                                  OpenCL 3.0 CUDA
  Driver Version                                  510.54
  Device OpenCL C Version                         OpenCL C 1.2
  Device Type                                     GPU
  Device Topology (NV)                            PCI-E, 00:00.0
  Device Profile                                  FULL_PROFILE
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Max compute units                               30
  Max clock frequency                             1531MHz
  Compute Capability (NV)                         6.1
  Device Partition                                (core)
    Max number of sub-devices                     1
    Supported partition types                     None
    Supported affinity domains                    (n/a)
  Max work item dimensions                        3
  Max work item sizes                             1024x1024x64
  Max work group size                             1024

....
.....
ICD loader properties
  ICD loader Name                                 OpenCL ICD Loader
  ICD loader Vendor                               OCL Icd free software
  ICD loader Version                              2.2.11
  ICD loader Profile                              OpenCL 2.1
        NOTE:   your OpenCL library only supports OpenCL 2.1,
                but some installed platforms support OpenCL 3.0.
                Programs using 3.0 features may crash
                or behave unexpectedly

Does this mean that "OpenCL is not support on my GPU / Driver". It seems to say that I have OpenCL 2.1, while I need 2.2. But the card support 3.0 .... Is this correct ?

Sorry, complete newbie on OpenCL.

blueberry commented 2 years ago

This means your setup should be ok. Your only implementation is nvidia, which supports Opencl 1.2. OpenCL 3 is basically 1.2 repackaged, And Opencl 2 has most features, but is left as a vestige as Nvidia and Apple sabotaged it. Complicated, I know...

behrica commented 2 years ago

Thanks, that helps. So what can I do about:

Execution error (UnsatisfiedLinkError) at java.lang.ClassLoader$NativeLibrary/load0 (ClassLoader.java:-2).
/root/.javacpp/cache/opencl-3.0-1.5.7-linux-x86_64.jar/org/bytedeco/opencl/linux-x86_64/libjniOpenCL.so: /usr/lib/x86_64-linux-gnu/libOpenCL.so: version `OPENCL_2.2' not found (required by /root/.javacpp/cache/opencl-3.0-1.5.7-linux-x86_64.jar/org/bytedeco/opencl/linux-x86_64/libjniOpenCL.so)

It happens during "lein test", I suppose during the tests which are using OpenCL. So "somthing" is still wrong in my setup, I guess.

So I assume correctly that "opencl-3.0-1.5.7" requires "openCL 3.0" (or at least more then 2.1) ? The dependency to it comes from deep-diamond

blueberry commented 2 years ago

It means that javacpp has some problems finding opencl on your system. However, note that Neanderthal/ClojureCL does not use javacpp for that, but another unrelated library. Javacpp dependency on OpenCL is probably coincidental as I don't directly use it, and javacpp dnnl library tries to load it on its own (there is an old solved issue at javacpp github that might give more info). IF neanderthal opencl tests pass it means everything should be ok with your system's opencl. Why javacpp has problems? My hunch is because your docker setup misses something, but I can't be sure since I don't use docker.

behrica commented 2 years ago

My Dockerimage is an Ubuntu 20.04 image. So it is Ubuntu in most regards.

I would like to promote usage of 'neanderthal', but the installation of it (or better said it's dependencies MKL and CUDA / OpenCL) are a gigantic hurdle. I consider myself a very experienced Java / Clojure / Linux / Docker user but I have no idea what I am doing here to get it work.

I am thinking somehow as well, that 'Docker' is the only way out, but that is not shared by lots of people, unfortunately.

I think that "maintaining and publishing" a Dockerfile and Image with a "working deep-diamond" where the user only need to type "docker run --gpus all xxxx" is important in this.

I thought I can do this on my own, but I think this is not the case. I know too little on CUDA, OpenCL and "extending Java with native code" in order to bring this forward myself.

The installation instructions are too general to allow me to further work on the Dockerfile efficiently.

behrica commented 2 years ago

I propose to go a step back, and I work on a "Minimal" Dockerfile (ubuntu based) which has only the goal to setup MKL, Cuda, OpenCL to get the "neanderthal" test suite working inside of it.

Maybe I could contribute that Dockerfile to the "neanderthal" GitHub. I am not sure, if I can get it "working" by myself, but maybe we could collaborate on it in some form.

At least by "reviewing" it and trying to see, if I do something which cannot work. What do you think ? Are you interested in supporting this, even when you are not a Docker user yourself.

blueberry commented 2 years ago

Yes, sure.

Fortunately, Neanderthal is a Java library, so it does not care whether it runs in docker or wherever else. As for the github, that's why I think the best home for the docker setup is a separate github repository.

I understand that it looks overwhelming, but I believe it is mostly because you're trying to fit together 10 moving parts of which you don't have experience with half of them. In reality, it is MUCH simpler:

For Neanderthal MKL to work, you ONLY need MKL .so files somewhere on your LD_LIBRARY_PATH. That's it. If other software using MKL work (pytorch or whatever) Neanderthal should work.

For Neanderthal CUDA backend, you ONLY need properly installed CUDA by Nvidia. If other CUDA-based software works, Neanderthal should (assuming you're not using some 3-rd party package system such as anaconda etc. that set their own local CUDA etc.)

For OpenCL it's similar...

Basically, there should not be any specific requirement by Neanderthal et al. other than having vanilla installations of these technologies as prescribed by their vendors, or simpler.

I would definitely recommend either following the setup recommended in Getting Started until you understand these moving parts, or at least following @jsa-aerial Saite setup, which seem to help in this regard.

jsa-aerial commented 2 years ago

I would advocate a Docker solution, as "out-of-the-box" and the quickest route to "try deep-diamond". (at least for the Linux users with Docker ...)

Frankly, if you want just works automatically "out of the box", aerosaite is the quickest and easiest route. Certainly for Linux users this is pretty much guaranteed to work. For CPU.

I think you are being naive about putting something together for automatic GPU use. There you are up against all the issues about getting the GPU usable completely aside from Neanderthal/DeepDiamond. There are just way too many variations, requirements and dependencies.

blueberry commented 2 years ago

... and, of course, for GPU computing to work, you'd have to have recent vendor drivers installed properly. That, usually, is not automatic anyway.

jsa-aerial commented 2 years ago

Hi @jsa-aerial that is really helpful. Maybe we can make this or some more focused standalone version of this an official recommendation for people that for some reason or another can't make the official vendor binaries work on their system?

That sounds like a reasonable/good idea. Suggestion on how to proceed?

blueberry commented 2 years ago

I'm not familiar with how saite works, so I don't know precisely, but is there a way to provide the basic MKL and/or CUDA distribution without other parts of saite and even without Neanderthal?

Anyway, it might be a good option for people who can't or don't want to follow my official guides to have the scripts you provide as an option, and if it works sufficiently predictable, we can link to your repository as and option from the getting started guide.

The only drawback I see is that it would make users read these guide even less, and it would appear more complicated. I'm specifically referring to this:

I would like to promote usage of 'neanderthal', but the installation of it (or better said it's dependencies MKL and CUDA / >OpenCL) are a gigantic hurdle. I consider myself a very experienced Java / Clojure / Linux / Docker user but I have no idea what I am doing here to get it work.

Perhaps if I have written: the user has to copy these 7 .so files at folder X, and must add this folder to LD_LIBRARY_PATH, and must restart shell, it would have been simpler. Instead, I opted to write a more versatile guide with all popular options, and users being impatient get lost in the sea of choices...

behrica commented 2 years ago

I would advocate a Docker solution, as "out-of-the-box" and the quickest route to "try deep-diamond". (at least for the Linux users with Docker ...)

Frankly, if you want just works automatically "out of the box", aerosaite is the quickest and easiest route. Certainly for Linux users this is pretty much guaranteed to work. For CPU.

I think you are being naive about putting something together for automatic GPU use. There you are up against all the issues about getting the GPU usable completely aside from Neanderthal/DeepDiamond. There are just way too many variations, requirements and dependencies.

This could be. But is this even true when using Docker ? Have you tried it ? Or does Docker at least help ? I am not convinced, that it is "not possible" to at least make one single Dockerimage which just works most of the time. But yes, similar to aerosaite.

Or at least that Dockerfile can be "parametrized" (so not assuming one fixed one for every situation, but a template)

So that the Dockerfile is at least a "base or template" which then a user can modify , which is hopefully easier then "installing from scratch"

jsa-aerial commented 2 years ago

I'm not familiar with how saite works, so I don't know precisely, but is there a way to provide the basic MKL and/or CUDA distribution without other parts of saite and even without Neanderthal?

For MKL, the links I quoted above satisfy this - they are just (g)zipped archives of the necessary sharable libs for each platform. That's it. So, no Saite and no Neanderthal and no DeepDiamond.

For the reasons I mentioned above, I decided to not support GPU, because it depends on way more than just the base platform just to get the GPU itself working for computation for you. Basically, in that case you are on your own for getting and installing the correct drivers and any other requirements.

Anyway, it might be a good option for people who can't or don't want to follow my official guides to have the scripts you provide as an option, and if it works sufficiently predictable, we can link to your repository as and option from the getting started guide.

That sounds fine - the scripts for Linux and (Intel) Mac have worked fine for several users - out of the box. If you are not using Saite, you would just need to grab the bits for running your stuff. These things are very small as there is in fact, very little that needs to be done.

The only drawback I see is that it would make users read these guide even less, and it would appear more complicated. I'm specifically referring to this:

Yes, that would be a drawback - any black box route will keep people from understanding what is really going on.

Perhaps if I have written: the user has to copy these 7 .so files at folder X, and must add this folder to LD_LIBRARY_PATH, and must restart shell, it would have been simpler. Instead, I opted to write a more versatile guide with all popular options, and users being impatient get lost in the sea of choices...

Maybe you can have a "TL;DR" section where you state this and then refer others to the details?

behrica commented 2 years ago

I would advocate a Docker solution, as "out-of-the-box" and the quickest route to "try deep-diamond". (at least for the Linux users with Docker ...)

Frankly, if you want just works automatically "out of the box", aerosaite is the quickest and easiest route. Certainly for Linux users this is pretty much guaranteed to work. For CPU.

I think you are being naive about putting something together for automatic GPU use. There you are up against all the issues about getting the GPU usable completely aside from Neanderthal/DeepDiamond. There are just way too many variations, requirements and dependencies.

I have seen this in Python land. "pip install tensorflow-gpu" was working for me out-of-the-box.

jsa-aerial commented 2 years ago

This could be. But is this even true when using Docker ?

Of course it is true using Docker - Docker is not some magic thing that somehow automatically knows what type (vendor, model, version) GPU, how many, and what the drivers are and if they are properly installed.

You'd have to have a Docker image for all the combinations

Myself, I don't much like Docker, but understand those who do...

behrica commented 2 years ago

This could be. But is this even true when using Docker ?

Of course it is true using Docker - Docker is not some magic thing that somehow automatically knows what type (vendor, model, version) GPU, how many, and what the drivers are and if they are properly installed.

Agree, but I would hope that over time all vendors will produce "one driver", which work for all their GPUs.

The we could have a parameterized Dockerfile, which just gets the "vendor". And from inside Docker I can even "read" what GPU I have, and could make decisions accordingly what to install.

So I still think that only a "view people maintain a Dockerfile" needed to know all nifty details, while the majority of user could just "use" the Docker file or image.

Similar to the JVM abstraction.

blueberry commented 2 years ago

"pip install tensorflow-gpu"

As far as I know, pip install tensorflow-gpu does not install cuda, it expects cuda to be available on your system. Exactly as Neanderthal. But the difference is that Neanderthal will throw an exception if you call absent cuda backend, while tensorflow might automatically fall back to the default engine, whatever it is?

OTOH, conda does (AFAIK) install CUDA, but an internal one. I could do that, if you comit to my (hypothetical) proprietary environment such is conda. You still have to make sure that the right GPU drivers are present.

behrica commented 2 years ago

@blueberry One more confusing point in the instructions is the required (or workable) CUDA version.

From my experience it does "for example", not work to use CUDA 11.4 and "[org.jcuda/jcuda "11.6.1"]" (which we get by default).

I just had this case and got an uggly.

/tmp/libJCudaDriver-11.6.1-linux-x86_64.so: /usr/lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by /tmp/libJCudaDr
iver-11.6.1-linux-x86_64.so)    

Explicite downgrading to "[org.jcuda/jcuda "11.4.1"]" solved it.

So it seems that "versions of native libraries" and "Clojure/Java dependencies" need to match more precisely then the instructions suggest. (at least from my understanding)

Again the only "more user friendly" form to help users in this I can see, is Docker. Which can be setup in a way that it "freezes" both, native libraries and deps.edn in a known state (at least for documentation purpose)

blueberry commented 2 years ago

Each version of Neanderthal CUDA backend is tied to the CUDA version specified in its dependency to JCuda. So, for the latest version, it is 11.6. If it says 11.4, that's because I missed to update the docs.