uncomplicate / neanderthal

Fast Clojure Matrix Library
http://neanderthal.uncomplicate.org
Eclipse Public License 1.0
1.06k stars 56 forks source link

OpenCL 1.2 Error in 0.17.0 #33

Closed neolee closed 6 years ago

neolee commented 6 years ago

Hi here I am, again :)

After upgrade to 0.17.0 the following test code cannot be compiled (which runs perfectly on 0.16.1):

(with-default-1
  (with-default-engine
    (with-release [gpu-x (clv (range 100000))
                   gpu-y (copy gpu-x)]
      (dot gpu-x gpu-y))))

The compile error message is in the attached file. And I'm using my MacBook Pro with:

OS: macOS 10.13.1 Java: Java(TM) SE Runtime Environment (build 1.8.0_151-b12) Leiningen: 2.8.1

error.txt

blueberry commented 6 years ago

It seems that your mac supports an older version of OpenCL than required for the kernels, or mac's OpenCL support does something slightly differently than OpenCL standard describes. It complains about the modf function, which should be there. To see how to solve this, I'd need a bit more information about the GPU hardware in your machine.

Can you please post the complete terminal output of clinfo?

blueberry commented 6 years ago

I would also make sure to upgrade Xcode (if you haven't already) since opencl support on macs seems to depend a lot on clang and other parts of apple's framework.

blueberry commented 6 years ago

@neolee do you still have this issue?

neolee commented 6 years ago

Yeah still have this issue and my macOS and Xcode are all the newest version. Attached is the output of clinfo. Seems there are some incompatible library in my system but cannot figure it out yet.

BTW, all work well under 0.16.1.

clinfo.txt

blueberry commented 6 years ago

What caught my eye in this output is the note at the end of the file. It complains that, while some of your 3 devices support OpenCL 1.2, the OpenCL library is 1.0. I know how this OpenCL library is controlled on Linux, but my old Macbook Air does not support OpenCL at all, so I cannot experiment with this, nor I can check whether this is something normal on Mac or is an issue.

Anyway, back to the problem of 16.1 working and 17.0 failing: 17.0 comes with lots of new math kernels. One (or more) of those cannot be compiled on your platform. I'm not sure why, since I checked the documentation, and all of them should be available even in OpenCL 1.1.

In general, it should be easy for me to fix, if we identify what exactly Mac's C compiler (clang) expect to be different than with the gcc that is used on Linux. Since I don't have a machine where I can try this, I need your help with identifying this.

Here's what I need you to try:

  1. Clone Neanderthal code
  2. Open the /src/opencl/uncomplicate/neanderthal/internal/device/vect-math.cl file, find all calls to the modf function, and replace it with a dummy assignment. Do not delete the whole kernel, just replace the code inside the function with something like this (that makes sense given the variable names): z[offset_z + get_global_id(0) * stride_z] = 3.3
  3. Do not run tests, those require OpenCL 2.0
  4. Just compile and install Neanderthal in your local Maven repository by running lein install.
  5. Include the changed snapshot version in your project and see whether the issue persists (it should not), and if some new issue appears.
  6. Post the detailed output results here, so I can get an idea what might be the issue on Mac.
blueberry commented 6 years ago

BTW. does this issue happens on all devices, or just some of them? You control this by not calling with-default-1, but by manually providing the platform and context. It is not difficult, and the example is already here: https://github.com/uncomplicate/neanderthal/blob/master/examples/hello-world/src/hello_world/opencl1.clj

neolee commented 6 years ago

Get it. Since I'm kind of busy this week so I may try it this weekend.

neolee commented 6 years ago

Hi @blueberry ,

I've tried per your request and while loading my test code the nrepl server quit with fatal error as before:

nREPL server started on port 4555 on host 127.0.0.1 - nrepl://127.0.0.1:4555
REPL-y 0.3.7, nREPL 0.2.12
Clojure 1.8.0
Java HotSpot(TM) 64-Bit Server VM 1.8.0_152-b16
    Docs: (doc function-name-here)
          (find-doc "part-of-name-here")
  Source: (source function-name-here)
 Javadoc: (javadoc java-object-or-class-here)
    Exit: Control+D or (exit) or (quit)
 Results: Stored in vars *1, *2, *3, an exception in *e

user=> #
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00000001251cf812, pid=87089, tid=0x0000000000009243
#
# JRE version: Java(TM) SE Runtime Environment (8.0_152-b16) (build 1.8.0_152-b16)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.152-b16 mixed mode bsd-amd64 compressed oops)
# Problematic frame:
# C  [libJOCL_2_0_0-apple-x86_64.dylib+0x17812]  deleteCallbackInfo(JNIEnv_*, CallbackInfo*&)+0x42
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /Users/neo/Code/Repo/learn-clojure/hs_err_pid87089.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
Exception in thread "Thread-1" clojure.lang.ExceptionInfo: Subprocess failed {:exit-code 134}
    at clojure.core$ex_info.invokeStatic(core.clj:4617)
    at clojure.core$ex_info.invoke(core.clj:4617)
    at leiningen.core.eval$fn__4134.invokeStatic(eval.clj:264)
    at leiningen.core.eval$fn__4134.invoke(eval.clj:260)
    at clojure.lang.MultiFn.invoke(MultiFn.java:233)
    at leiningen.core.eval$eval_in_project.invokeStatic(eval.clj:366)
    at leiningen.core.eval$eval_in_project.invoke(eval.clj:356)
    at leiningen.repl$server$fn__5864.invoke(repl.clj:244)
    at clojure.lang.AFn.applyToHelper(AFn.java:152)
    at clojure.lang.AFn.applyTo(AFn.java:144)
    at clojure.core$apply.invokeStatic(core.clj:646)
    at clojure.core$with_bindings_STAR_.invokeStatic(core.clj:1881)
    at clojure.core$with_bindings_STAR_.doInvoke(core.clj:1881)
    at clojure.lang.RestFn.invoke(RestFn.java:425)
    at clojure.lang.AFn.applyToHelper(AFn.java:156)
    at clojure.lang.RestFn.applyTo(RestFn.java:132)
    at clojure.core$apply.invokeStatic(core.clj:650)
    at clojure.core$bound_fn_STAR_$fn__4671.doInvoke(core.clj:1911)
    at clojure.lang.RestFn.invoke(RestFn.java:397)
    at clojure.lang.AFn.run(AFn.java:22)
    at java.lang.Thread.run(Thread.java:748)

I also attach the modified vect-math.cl file for your information.

vect-math.cl.zip

kenfehling commented 6 years ago

The same modf error happens to me on Ubuntu.

Kernel: 4.10.0-40-generic 16.04.1-Ubuntu SMP (x86_64)

Leiningen 2.7.1 on Java 1.8.0_151 Java HotSpot(TM) 64-Bit Server VM Java(TM) SE Runtime Environment (build 1.8.0_151-b12)

blueberry commented 6 years ago

@kenfehling What hardware and drivers you have? Can you post the output of clinfo here?

kenfehling commented 6 years ago

OK sure. I've attached my clinfo

blueberry commented 6 years ago

Just a quick update: I've managed to reproduce this problem. The solution will follow soon in the source repository, and will be included in the next release of neanderthal.

blueberry commented 6 years ago

This fix has been released to Clojars. Please try 0.17.2 and report here whether this issue has been solved.

neolee commented 6 years ago

Thanks @blueberry it works now!