uncomplicate / deep-diamond

A fast Clojure Tensor & Deep Learning library
https://aiprobook.com
Eclipse Public License 1.0
432 stars 17 forks source link

cannot get test working with factory "neanderthatl" #15

Closed behrica closed 2 years ago

behrica commented 2 years ago

I cannot get some tests passing of this package, getting MKL errors

Running for example:

(with-release [fact (neanderthal-factory)]
  (test-stochastic-gradient-descent-adam fact))

gives

 Show: Project-Only All 
  Hide: Clojure Java REPL Tooling Duplicates  (13 frames hidden)

1. Unhandled clojure.lang.ExceptionInfo
   MKL error.
   {:error-code -1140}
                   mkl.clj:  991  uncomplicate.neanderthal.internal.host.mkl.FloatVectorEngine/rand_normal
               factory.clj:  277  uncomplicate.diamond.internal.dnnl.factory.FloatTensorEngine/rand_normal
               factory.clj:  277  uncomplicate.diamond.internal.dnnl.factory.FloatTensorEngine/rand_normal
                random.clj:   35  uncomplicate.neanderthal.random/rand-normal!
                   dnn.clj:  314  uncomplicate.diamond.dnn/init!/fn/fn
              directed.clj:  469  uncomplicate.diamond.internal.neanderthal.directed.AdamLayer/init
               network.clj:  160  uncomplicate.diamond.internal.network.SequentialNetworkTraining/init
                   dnn.clj:  314  uncomplicate.diamond.dnn/init!/fn

I followed instructions to include "org.bytedeco/mkl-platform-redist " into project.clj:

:dependencies [[org.clojure/clojure "1.11.1"]
                 [uncomplicate/neanderthal "0.44.0"]

                 [org.bytedeco/dnnl-platform "2.5.2-1.5.7"]
                 [org.bytedeco/mkl-platform-redist "2022.0-1.5.7"]
                 ;; [org.bytedeco/mkl-platform-redist "2020.3-1.5.4"]
                 ;; [org.bytedeco/mkl-platform-redist "2021.3-1.5.6"]
                 [org.jcuda/jcudnn "11.6.1"]]

but it fails with all 3 versions, always same error message.

Some tests pass, but some not.

blueberry commented 2 years ago

Hmmm. I can't reproduce this. Does it also fail with test/.../internal/dnnl/directed-test namespace, that is, with the DNNL engine?

How about Neanderthal tests? Does Neanderthal itself work on your machine with this setup, outside the context of Deep Diamond?

Did it work in earlier versions?

Can you share some details of your hardware/software platform?

behrica commented 2 years ago

I can run the hello-world project in /example in neandertahl in native , cuda, opencl sucessfully.

Using the same dependencies in deep-diamond makes some tests fail, but some pass.

I can for example, run some of teh dnn_test.clj sucessfully:

(with-release [fact (cudnn-factory)]
    (test-activation-relu fact))

(with-release [fact (neanderthal-factory)]
    (test-activation-relu fact))
behrica commented 2 years ago

I am on Arch Linux,

openjdk version "11.0.15" 2022-04-19
OpenJDK Runtime Environment (build 11.0.15+10)
OpenJDK 64-Bit Server VM (build 11.0.15+10, mixed mode)
behrica commented 2 years ago

In teh directed-test the first few pass,

(with-release [fact (neanderthal-factory)]
  (test-sum fact)
  (test-activation-relu fact)
  (test-activation-sigmoid fact)
  (test-inner-product-training fact)
  (test-fully-connected-inference fact)
  (test-fully-connected-transfer fact)
  (test-fully-connected-training fact)
  (test-fully-connected-training-adam fact)
  (test-fully-connected-layer-1 fact)
  (test-fully-connected-layer-2 fact)
  (test-sequential-network-linear fact)
  (test-sequential-network-detailed fact)
  (test-sequential-network-batched fact)
  (test-quadratic-cost fact)
  (test-sequential-network-sigmoid-sgd fact)
  (test-sequential-network-sigmoid-adam fact)
  (test-gradient-descent fact)
  (test-stochastic-gradient-descent-sgd fact)
  (test-stochastic-gradient-descent-adam fact))

The first failing is (test-sequential-network-linear fact)

If I run them one-by-one 5 or so fail.

behrica commented 2 years ago

I had a system wide installation of MKL, which I just removed. This changes the error. Know I don't get native-factory working at all.

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Execution error (UnsatisfiedLinkError) at java.lang.ClassLoader$NativeLibrary/load0 (ClassLoader.java:-2).
/tmp/libneanderthal-mkl-0.33.014592370675494885641.so: libmkl_rt.so: cannot open shared object file: No such file or directory
ERROR: Unhandled REPL handler exception processing message {:op stacktrace, :nrepl.middleware.print/stream? 1, :nrepl.middleware.print/print cider.nrepl.pprint/pprint, :nrepl.middleware.print/quota 10000, :nrepl.middleware.print/buffer-size 4096, :nrepl.middleware.print/options {:right-margin 80}, :session 6c986f5f-9b91-49c6-b629-4f538f5a131e, :id 12}
java.util.concurrent.ExecutionException: java.lang.NoClassDefFoundError: Could not initialize class uncomplic

I tgough that I don't need any MKL installtion when using these dependencies:

:dependencies [[org.clojure/clojure "1.11.1"]
                 [uncomplicate/neanderthal "0.44.0"]

                 [org.bytedeco/dnnl-platform "2.5.2-1.5.7"]
                 [org.bytedeco/mkl-platform-redist "2022.0-1.5.7"]
                 ;; [org.bytedeco/mkl-platform-redist "2020.3-1.5.4"]
                 ;; [org.bytedeco/mkl-platform-redist "2021.3-1.5.6"]
                 [org.jcuda/jcudnn "11.6.1"]]

Is this not correct ?

behrica commented 2 years ago

Without the systemwide MKL the neanderthal hello-world is not working any more neither:

Execution error (UnsatisfiedLinkError) at java.lang.ClassLoader$NativeLibrary/load0 (ClassLoader.java:-2).
/tmp/libneanderthal-mkl-0.33.03443062959179023093.so: libmkl_rt.so: cannot open shared object file: No such file or directory
behrica commented 2 years ago

And indeed , it does not find the dynamic library:

ldd libneanderthal-mkl-0.33.03443062959179023093.so
ldd: warning: you do not have execution permission for `./libneanderthal-mkl-0.33.03443062959179023093.so'
    linux-vdso.so.1 (0x00007fff82635000)
    libmkl_rt.so => not found
    libpthread.so.0 => /usr/lib/libpthread.so.0 (0x00007f6b8b7b2000)
    libdl.so.2 => /usr/lib/libdl.so.2 (0x00007f6b8b7ad000)
    libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x00007f6b8b400000)
    libm.so.6 => /usr/lib/libm.so.6 (0x00007f6b8b6c6000)
    libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0x00007f6b8b6a6000)
    libc.so.6 => /usr/lib/libc.so.6 (0x00007f6b8b000000)
    /usr/lib64/ld-linux-x86-64.so.2 (0x00007f6b8b808000)
behrica commented 2 years ago

Could it bee that the issue is

libmkl_rt.so vs libmkl_rt.so.2 ?

libneanderthal-mkl-0.33.03443062959179023093.so links to the former, while the MKL jar contains the latter ?

jar tf mkl/2022.0-1.5.7/mkl-2022.0-1.5.7-linux-x86_64-redist.jar | grep libmkl_rt
org/bytedeco/mkl/linux-x86_64/libmkl_rt.so.2
behrica commented 2 years ago

Seems to be similar to https://github.com/uncomplicate/neanderthal/issues/119 (but that is for Windows)

behrica commented 2 years ago

I tried to go back to older `deep-diamond' versions, but in none the tests succesfully (via 'lein test') I found one configuration which exposed the same error as above, even without a global MKL installation.

 MKL error.
   {:error-code -1140}
                   mkl.clj:  991 

Not sure this helps:

(defproject uncomplicate/deep-diamond "0.16.0-alpha"
  :description "Fast Clojure Deep Learning Library"
  :author "Dragan Djuric"
  :url "http://github.com/uncomplicate/deep-diamond"
  :license {:name "Eclipse Public License"
            :url "http://www.eclipse.org/legal/epl-v10.html"}
  :dependencies [[org.clojure/clojure "1.10.1"]
                 [uncomplicate/neanderthal "0.38.0"]
                 [org.bytedeco/dnnl-platform "1.6.2-1.5.4"]
                 [org.jcuda/jcudnn "11.0.0"]
                 ;; https://mvnrepository.com/artifact/org.bytedeco/mkl-platform-redist
                 [org.bytedeco/mkl-platform-redist "2020.3-1.5.4"]]

  :profiles {:dev {:plugins [[lein-midje "3.2.1"]
                             [lein-codox "0.10.6"]]
                   :resource-paths ["data"]
                   :global-vars {*warn-on-reflection* true
                                 *assert* false
                                 *unchecked-math* :warn-on-boxed
                                 *print-length* 128}
                   :dependencies [[midje "1.9.9"]
                                  [org.clojure/data.csv "1.0.0"]]}}

  :repositories [["snapshots" {:url "https://oss.sonatype.org/content/repositories/snapshots/"
                               :snapshots true :sign-releases false :checksum :warn :update :daily}]]

  :codox {:metadata {:doc/format :markdown}
          :src-dir-uri "http://github.com/uncomplicate/deep-diamond/blob/master/"
          :src-linenum-anchor-prefix "L"
          :output-path "docs/codox"}

  :jvm-opts ^:replace ["-Dclojure.compiler.direct-linking=true" "-XX:+UseLargePages"
                       "--add-opens=java.base/jdk.internal.ref=ALL-UNNAMED"]

  :javac-options ["-target" "1.8" "-source" "1.8" "-Xlint:-options"]
  :source-paths ["src/clojure" "src/device"])
blueberry commented 2 years ago

I tried to go back to older `deep-diamond' versions, but in none the tests succesfully (via 'lein test')

Could that be

Could it bee that the issue is

libmkl_rt.so vs libmkl_rt.so.2 ?

libneanderthal-mkl-0.33.03443062959179023093.so links to the former, while the MKL jar contains the latter ?

jar tf mkl/2022.0-1.5.7/mkl-2022.0-1.5.7-linux-x86_64-redist.jar | grep libmkl_rt
org/bytedeco/mkl/linux-x86_64/libmkl_rt.so.2

It probably is. Thank you for discovering this. Can you go back to the global MKL installation for the time being as the system-wide intel-mkl package distributes the MKL 2020.4 with the libmkl_rt.so file. Does everything work in that configuration (do not load bytedeco)?

I am not really sure how to approach solving this, as the neanderthal binary has to be built with the specific dependency statically (be it libmkl_rt.so, libmkl_rt.so.1, or libmkl_rt.so.2). AFAIK. I don't know why Intel introduced all these successive versions either. Any suggestions are highly welcome.

blueberry commented 2 years ago

In the worst case, I can build a neanderthal binary for new oneAPI versions, and distribute is as an alternative dependency that the user can choose in project.clj.

behrica commented 2 years ago

In this setup:

I am back to the initial error:

:cause "MKL error.", :data {:error-code -1140}, :phase :execution}}

behrica commented 2 years ago

I would suggest to make in some form a Dockerfile.

Either a working one, or a not working one to debug.

behrica commented 2 years ago

I have never used deep-diamond before, so I don't know if there is an fundamental issue with Arch Linux

behrica commented 2 years ago

The following Dockerfile reproduces an error without MKL, but with dependcy to bytecode 2022.0-1.5.7

# failing with
# Execution error (UnsatisfiedLinkError) at java.lang.ClassLoader$NativeLibrary/load0 (ClassLoader.java:-2).
#/tmp/libneanderthal-mkl-0.33.07653633467081296505.so: libmkl_rt.so: cannot open shared object file: No such file or directory

FROM clojure:openjdk-11-lein-slim-bullseye
RUN apt-get update && apt-get -y install git
RUN git clone https://github.com/uncomplicate/deep-diamond.git
WORKDIR /tmp/deep-diamond
RUN git checkout eb3051031281ca55b09663a1375ce4e57a5f6bf1
RUN lein update-in :dependencies conj "[org.bytedeco/mkl-platform-redist \"2022.0-1.5.7\"]" -- test uncomplicate.diamond.internal.neanderthal.directed-test
blueberry commented 2 years ago

I have never used deep-diamond before, so I don't know if there is an fundamental issue with Arch Linux

There shouldn't be, since I develop on Arch Linux too.

Could you please clone the Neanderthal repository and run Neanderthal test suite with lein midje (lein test should work too, but just in case). This is a Neanderthal/MKL issue, rather than deep-diamond one. Based on your test failure, it seems to me that Neanderthal's rand-normal fails on your machine for some reason, but I would need to see the results of Neandrthal test suite to be able to see where to look. Once Neanderthal is fully working for you, I believe this error that arises in deep diamond will disappear.

Is there a specific reason you're using OpenJDK 11 instead of a more recent one? I develop and test this on Java 18 (on Arch Linux).

Please note this (solved) issue related to upgrade to Java 16: https://github.com/uncomplicate/neanderthal/issues/115

blueberry commented 2 years ago

I found the culprit. The Neanderthal/MKL per se are working as intended, as it seems, but for some reason the ARS5 stream is not supported on your machine.

The -1140 error code in MKL is this one:

/* ARS5 stream related errors */
#define VSL_RNG_ERROR_ARS5_NOT_SUPPORTED        -1140

https://www.intel.com/content/www/us/en/develop/documentation/onemkl-vsnotes/top/testing-of-basic-random-number-generators/brng-properties-and-testing-results/ars5.html

https://www.intel.com/content/www/us/en/develop/documentation/onemkl-developer-reference-c/top/statistical-functions/random-number-generators/basic-generators.html

says: "ARS-5 counter-based pseudorandom number generator with a period of 2128, which uses instructions from the AES-NI set ARS5".

https://en.wikipedia.org/wiki/AES_instruction_set

This instruction set was added to x86 by Intel and AMD in 2008. There is a list of the supported architectures in the aforementioned wikipedia article. Can you please check where your CPU stands? Recent-ish (10 yrs or so) processors should support it, but if you're running this on an older one that might be the problem.

behrica commented 2 years ago

I am not an CPU expert, bu I hink that I have an "I5", which should be supported in teh article:

processor   : 4
vendor_id   : GenuineIntel
cpu family  : 6
model       : 158
model name  : Intel(R) Core(TM) i5-9400F CPU @ 2.90GHz
stepping    : 10
microcode   : 0xaa
cpu MHz     : 2900.000
cache size  : 9216 KB
physical id : 0
siblings    : 6
core id     : 4
cpu cores   : 6
apicid      : 8
initial apicid  : 8
fpu     : yes
fpu_exception   : yes
cpuid level : 22
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp flush_l1d
vmx flags   : vnmi preemption_timer invvpid ept_x_only ept_ad ept_1gb flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest ple shadow_vmcs pml ept_mode_based_exec
bugs        : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit srbds
bogomips    : 5802.42
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor   : 5
vendor_id   : GenuineIntel
cpu family  : 6
model       : 158
model name  : Intel(R) Core(TM) i5-9400F CPU @ 2.90GHz
stepping    : 10
microcode   : 0xaa
cpu MHz     : 2900.000
cache size  : 9216 KB
physical id : 0
siblings    : 6
core id     : 5
cpu cores   : 6
apicid      : 10
initial apicid  : 10
fpu     : yes
fpu_exception   : yes
cpuid level : 22
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp flush_l1d
vmx flags   : vnmi preemption_timer invvpid ept_x_only ept_ad ept_1gb flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest ple shadow_vmcs pml ept_mode_based_exec
bugs        : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit srbds
bogomips    : 5802.42
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:
blueberry commented 2 years ago

i5 is only a category. The processor generation is determined by 9 in 9400, which is a fairly recent processor that supports AES-NI.

https://www.techpowerup.com/cpu-specs/core-i5-9400f.c2145

Now, the first step is to run Neanderthal tests, so we can be sure whether only the random number generation fails, in which case we would have to find out why that feature fails in MKL. The error suggests that MKL determines that your processor doesn't support AES-NI (it might have been disabled in the OS perhaps?). But it might be that some other things fail, which would suggest that something in Neanderthal is the problem. Or it might all pass, which would suggest that deep diamond has some weird interaction with that feature, although it doesn't seem that probable from what I've seen.

blueberry commented 2 years ago

Please read this:

https://www.cyberciti.biz/faq/how-to-find-out-aes-ni-advanced-encryption-enabled-on-linux-system/

It seems that some vendors ship their computers with AES-NI disabled in BIOS. That might be worth checking!

behrica commented 2 years ago

Neanthertal tests fail with the same.

ctual result:
clojure.lang.ExceptionInfo: MKL error. {:error-code -1140}
  uncomplicate.neanderthal.internal.host.mkl.FloatGEEngine.rand_normal(mkl.clj:1501)
  uncomplicate.neanderthal.random$rand_normal_BANG_.invokeStatic(random.clj:35)
  uncomplicate.neanderthal.random_test$test_ge_rand_normal$fn__53386$fn__53387$fn__53392.invoke(random_test.clj:93)
  uncomplicate.neanderthal.random_test$test_ge_rand_normal$fn__53386$fn__53387.invoke(random_test.clj:104)
  uncomplicate.neanderthal.random_test$test_ge_rand_normal$fn__53386.invoke(random_test.clj:93)
  uncomplicate.neanderthal.random_test$test_ge_rand_normal.invokeStatic(random_test.clj:92)
  uncomplicate.neanderthal.random_test$test_all.invokeStatic(random_test.clj:140)
  uncomplicate.neanderthal.mkl_test$eval56566.invokeStatic(mkl_test.clj:107)
  uncomplicate.neanderthal.mkl_test$eval56566.invoke(mkl_test.clj:107)
Checking function: (roughly -100 0.03)

lein test user

Should we close here an I open an issue in neanderthal ?

behrica commented 2 years ago

Please read this:

https://www.cyberciti.biz/faq/how-to-find-out-aes-ni-advanced-encryption-enabled-on-linux-system/

It seems that some vendors ship their computers with AES-NI disabled in BIOS. That might be worth checking!

indeed, mssing:

grep -m1 -o aes /proc/cpuinfo

-> empty
blueberry commented 2 years ago

I believe that is a good idea. Please also link this discussion there.

According to https://en.wikipedia.org/wiki/List_of_Intel_Core_i5_processors your processor does support AES-NI. My hunch is that currently it is disabled in BIOS (according to the internet, it might require BIOS update for some vendors).

behrica commented 2 years ago

It's not even a "Bug" anywhere, so we can it as well close here, and just keep for reference.

behrica commented 2 years ago

I reran tests on a VM where the "grep -m1 -o aes /proc/cpuinfo" does return "aes" and all test pass. So it is indeed an issue of "my PC".