hughperkins / cltorch

An OpenCL backend for torch.
Other
291 stars 26 forks source link

Segfault on loading cltorch #5

Closed mlajtos closed 9 years ago

mlajtos commented 9 years ago

First, thank you for fixing the build for OS X.

Now, loading cltorch on mac fails with segfault:

$ th -l cltorch
Segmentation fault: 11

Don't know how to get the stack trace.

hughperkins commented 9 years ago

Hmmm. If it segfaults on load, thats pretty early. Not much has happened by now. Can you confirm the output from clinfo and gpuinfo pleaes? clinfo should come with your opencl system. gpuinfo.... on my system its installed in ~/torch/install/bin, so you should just be able to run it.

mlajtos commented 9 years ago

I didn't have clinfo, so I used one from [http://graphics.stanford.edu/~yoel/notes/clInfo.c](). gpuinfo is also missing and can't find any suitable replacement.

Output of clinfo:

Found 1 platform(s).
platform[0x7fff0000]: profile: FULL_PROFILE
platform[0x7fff0000]: version: OpenCL 1.2 (Feb 27 2015 01:29:10)
platform[0x7fff0000]: name: Apple
platform[0x7fff0000]: vendor: Apple
platform[0x7fff0000]: Found 2 device(s).

Device number  1
    device[0xffffffff]: NAME: Intel(R) Core(TM) i5-3427U CPU @ 1.80GHz
    device[0xffffffff]: VENDOR: Intel
    device[0xffffffff]: PROFILE: FULL_PROFILE
    device[0xffffffff]: VERSION: OpenCL 1.2 
    device[0xffffffff]: DRIVER_VERSION: 1.1

    device[0xffffffff]: Type: CPU 
    device[0xffffffff]: EXECUTION_CAPABILITIES: Kernel Native 
    device[0xffffffff]: GLOBAL_MEM_CACHE_TYPE: Read-Write (2)
    device[0xffffffff]: CL_DEVICE_LOCAL_MEM_TYPE: Global (2)
    device[0xffffffff]: SINGLE_FP_CONFIG: 0xbf
    device[0xffffffff]: QUEUE_PROPERTIES: 0x2

    device[0xffffffff]: VENDOR_ID: 4294967295
    device[0xffffffff]: MAX_COMPUTE_UNITS: 4
    device[0xffffffff]: MAX_WORK_ITEM_DIMENSIONS: 3
    device[0xffffffff]: MAX_WORK_GROUP_SIZE: 1024
    device[0xffffffff]: PREFERRED_VECTOR_WIDTH_CHAR: 16
    device[0xffffffff]: PREFERRED_VECTOR_WIDTH_SHORT: 8
    device[0xffffffff]: PREFERRED_VECTOR_WIDTH_INT: 4
    device[0xffffffff]: PREFERRED_VECTOR_WIDTH_LONG: 2
    device[0xffffffff]: PREFERRED_VECTOR_WIDTH_FLOAT: 4
    device[0xffffffff]: PREFERRED_VECTOR_WIDTH_DOUBLE: 2
    device[0xffffffff]: MAX_CLOCK_FREQUENCY: 1800
    device[0xffffffff]: ADDRESS_BITS: 64
    device[0xffffffff]: MAX_MEM_ALLOC_SIZE: 1073741824
    device[0xffffffff]: IMAGE_SUPPORT: 1
    device[0xffffffff]: MAX_READ_IMAGE_ARGS: 128
    device[0xffffffff]: MAX_WRITE_IMAGE_ARGS: 8
    device[0xffffffff]: IMAGE2D_MAX_WIDTH: 8192
    device[0xffffffff]: IMAGE2D_MAX_HEIGHT: 8192
    device[0xffffffff]: IMAGE3D_MAX_WIDTH: 2048
    device[0xffffffff]: IMAGE3D_MAX_HEIGHT: 2048
    device[0xffffffff]: IMAGE3D_MAX_DEPTH: 2048
    device[0xffffffff]: MAX_SAMPLERS: 16
    device[0xffffffff]: MAX_PARAMETER_SIZE: 4096
    device[0xffffffff]: MEM_BASE_ADDR_ALIGN: 1024
    device[0xffffffff]: MIN_DATA_TYPE_ALIGN_SIZE: 128
    device[0xffffffff]: GLOBAL_MEM_CACHELINE_SIZE: 3145728
    device[0xffffffff]: GLOBAL_MEM_CACHE_SIZE: 64
    device[0xffffffff]: GLOBAL_MEM_SIZE: 4294967296
    device[0xffffffff]: MAX_CONSTANT_BUFFER_SIZE: 65536
    device[0xffffffff]: MAX_CONSTANT_ARGS: 8
    device[0xffffffff]: LOCAL_MEM_SIZE: 32768
    device[0xffffffff]: ERROR_CORRECTION_SUPPORT: 0
    device[0xffffffff]: PROFILING_TIMER_RESOLUTION: 1
    device[0xffffffff]: ENDIAN_LITTLE: 1
    device[0xffffffff]: AVAILABLE: 1
    device[0xffffffff]: COMPILER_AVAILABLE: 1

Device number  2
    device[0x1024400]: NAME: HD Graphics 4000
    device[0x1024400]: VENDOR: Intel
    device[0x1024400]: PROFILE: FULL_PROFILE
    device[0x1024400]: VERSION: OpenCL 1.2 
    device[0x1024400]: DRIVER_VERSION: 1.2(Mar 27 2015 01:47:22)

    device[0x1024400]: Type: GPU 
    device[0x1024400]: EXECUTION_CAPABILITIES: Kernel 
    device[0x1024400]: GLOBAL_MEM_CACHE_TYPE: None (0)
    device[0x1024400]: CL_DEVICE_LOCAL_MEM_TYPE: Local (1)
    device[0x1024400]: SINGLE_FP_CONFIG: 0xbe
    device[0x1024400]: QUEUE_PROPERTIES: 0x2

    device[0x1024400]: VENDOR_ID: 16925696
    device[0x1024400]: MAX_COMPUTE_UNITS: 16
    device[0x1024400]: MAX_WORK_ITEM_DIMENSIONS: 3
    device[0x1024400]: MAX_WORK_GROUP_SIZE: 512
    device[0x1024400]: PREFERRED_VECTOR_WIDTH_CHAR: 1
    device[0x1024400]: PREFERRED_VECTOR_WIDTH_SHORT: 1
    device[0x1024400]: PREFERRED_VECTOR_WIDTH_INT: 1
    device[0x1024400]: PREFERRED_VECTOR_WIDTH_LONG: 1
    device[0x1024400]: PREFERRED_VECTOR_WIDTH_FLOAT: 1
    device[0x1024400]: PREFERRED_VECTOR_WIDTH_DOUBLE: 0
    device[0x1024400]: MAX_CLOCK_FREQUENCY: 1150
    device[0x1024400]: ADDRESS_BITS: 64
    device[0x1024400]: MAX_MEM_ALLOC_SIZE: 268435456
    device[0x1024400]: IMAGE_SUPPORT: 1
    device[0x1024400]: MAX_READ_IMAGE_ARGS: 128
    device[0x1024400]: MAX_WRITE_IMAGE_ARGS: 8
    device[0x1024400]: IMAGE2D_MAX_WIDTH: 16384
    device[0x1024400]: IMAGE2D_MAX_HEIGHT: 16384
    device[0x1024400]: IMAGE3D_MAX_WIDTH: 2048
    device[0x1024400]: IMAGE3D_MAX_HEIGHT: 2048
    device[0x1024400]: IMAGE3D_MAX_DEPTH: 2048
    device[0x1024400]: MAX_SAMPLERS: 16
    device[0x1024400]: MAX_PARAMETER_SIZE: 1024
    device[0x1024400]: MEM_BASE_ADDR_ALIGN: 1024
    device[0x1024400]: MIN_DATA_TYPE_ALIGN_SIZE: 128
    device[0x1024400]: GLOBAL_MEM_CACHELINE_SIZE: 0
    device[0x1024400]: GLOBAL_MEM_CACHE_SIZE: 0
    device[0x1024400]: GLOBAL_MEM_SIZE: 1073741824
    device[0x1024400]: MAX_CONSTANT_BUFFER_SIZE: 65536
    device[0x1024400]: MAX_CONSTANT_ARGS: 8
    device[0x1024400]: LOCAL_MEM_SIZE: 65536
    device[0x1024400]: ERROR_CORRECTION_SUPPORT: 0
    device[0x1024400]: PROFILING_TIMER_RESOLUTION: 80
    device[0x1024400]: ENDIAN_LITTLE: 1
    device[0x1024400]: AVAILABLE: 1
    device[0x1024400]: COMPILER_AVAILABLE: 1
hughperkins commented 9 years ago

Hmmm, looks pretty convincing...

hughperkins commented 9 years ago

I've created a debug version, that prints out some small debuggin statements. Can you try installing and running this please? Something like:

git clone --recursive https://github.com/hughperkins/cltorch.git -b debug cltorch-debug
cd cltorch-debug
luarocks make rocks/cltorch-scm-1.rockspec
th -l cltorch

On my computer, I get the output:

user@pear:~/git/cltorch$ th -l cltorch
init.lua START
init.lua loading libcltorch... 
luaopen_libcltorch START
luaopen_libcltorch storage init
luaopen_libcltorch tensor init
luaopen_libcltorch tensormath init
luaopen_libcltorch tensoroperator init
luaopen_libcltorch END
...  libcltorch loaded
loading Tensor.lua...
loading Random.lua...
init.lua END

Depending on where it crashes, more or less of this will be missing.

mlajtos commented 9 years ago

Well, pretty early:

$ th -l cltorch
init.lua START
init.lua loading libcltorch... 
luaopen_libcltorch START
Segmentation fault: 11

Is there any way to control execution flow, i.e. debug in Lua scripts?

szagoruyko commented 9 years ago

you can do

gdb luajit
run
require 'cltorch'
backtrace
mlajtos commented 9 years ago

@szagoruyko, thank you.

th> require 'cltorch'
init.lua START
init.lua loading libcltorch... 
luaopen_libcltorch START
Process 60809 stopped
* thread #1: tid = 0x203dbe, 0x0000000002311068 libEasyCL.dylib`lua_createtable(L=0x0000000000079378, narray=0, nrec=0) + 24 at lapi.c:580, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
    frame #0: 0x0000000002311068 libEasyCL.dylib`lua_createtable(L=0x0000000000079378, narray=0, nrec=0) + 24 at lapi.c:580
   577  
   578  LUA_API void lua_createtable (lua_State *L, int narray, int nrec) {
   579    lua_lock(L);
-> 580    luaC_checkGC(L);
   581    sethvalue(L, L->top, luaH_new(L, narray, nrec));
   582    api_incr_top(L);
   583    lua_unlock(L);
hughperkins commented 9 years ago

Cool. Can you type 'bt', and paste the output?

szagoruyko commented 9 years ago

I've got the same segfault, it fails here https://github.com/hughperkins/cltorch/blob/master/init.cpp#L157

hughperkins commented 9 years ago

Ok. That line is not the actual root cause reason, right? Something in the lua, earlier? https://github.com/hughperkins/cltorch/blob/master/init.lua ? Or, something odd about the way the library is built somehow?

hughperkins commented 9 years ago

(like, eg cutorch has the same first line in fact: https://github.com/torch/cutorch/blob/master/init.c#L763 )

hughperkins commented 9 years ago

Hmmm, maybe something to do with some mismatch between the lua library linked with during build, and the one linked with at runtime? eg different versions?

hughperkins commented 9 years ago

It's getting lua from EasyCL. Maybe that's not a good sign?

hughperkins commented 9 years ago

@szagoruyko I guess when we are building EasyCL for torch, we should persuade it to link with libluajit.so, rather than with its own internal lua perhaps?

user@pear:~/git/cltorch$ grep newtable EasyCL/thirdparty/lua-5.1.5/src/*.h
EasyCL/thirdparty/lua-5.1.5/src/lua.h:#define lua_newtable(L)       lua_createtable(L, 0, 0)
user@pear:~/git/cltorch$ nm ~/torch/install/lib/libluajit.so | grep createtable
0000000000068290 T lua_createtable
user@pear:~/git/cltorch$ nm ~/torch/install/lib/libEasyCL.so | grep createtable
000000000002bff0 T lua_createtable
szagoruyko commented 9 years ago

yes that's the issue probably

hughperkins commented 9 years ago

Any idea where the link_libraries(libluajit.so) statement is? Hunting around in ~/torch/install/share/cmake/torch , but didnt find it yet.

hughperkins commented 9 years ago

Ok. Can you try the following please:

git clone --recursive https://github.com/hughperkins/cltorch.git -b new-clew cltorch-nolua
cd cltorch-nolua
luarocks make rocks/cltorch-scm-1.rockspec
hughperkins commented 9 years ago

Oh wait, this branch combines two 'fixes', and the first fix doesnt quite work. I'll let you know once this branch is ready to try.

hughperkins commented 9 years ago

Note: the other issue in this branch, the missing opencl include files, should be fixed now. So, the only open issue is the lua-building bit. Per the other thread, using new-clew branch gives an error about missing lua imports?

hughperkins commented 9 years ago

Hi guys, do you mind seeing if you are able to build the following repo?

git clone https://github.com/hughperkins/prototyping-lualink.git
cd prototyping-lualink
./run.sh

I dont have a Mac, so I'm unsure how to build lua modules in a cross-platform way. This works for me on ubuntu 14.04 64-bit.

mlajtos commented 9 years ago
/Users/milanlajtos/Downloads/prototyping-lualink/mylib.cpp:5:12: fatal error: 'lua.h' file not found
  #include "lua.h"
           ^
1 error generated.
make[2]: *** [CMakeFiles/mylib.dir/mylib.cpp.o] Error 1
make[1]: *** [CMakeFiles/mylib.dir/all] Error 2
make: *** [all] Error 2
luajit: test.lua:1: module 'mylib' not found:
    no field package.preload['mylib']
    no file './mylib.lua'
    no file '/Users/milanlajtos/torch/install/share/luajit-2.1.0-alpha/mylib.lua'
    no file '/usr/local/share/lua/5.1/mylib.lua'
    no file '/usr/local/share/lua/5.1/mylib/init.lua'
    no file '/Users/milanlajtos/torch/install/share/lua/5.1/mylib.lua'
    no file '/Users/milanlajtos/torch/install/share/lua/5.1/mylib/init.lua'
    no file 'build/libmylib.so'
stack traceback:
    [C]: in function 'require'
    test.lua:1: in main chunk
    [C]: at 0x0107188630

Changing path in CMakeLists.txt resulted in this:

CMake Warning (dev):
  Policy CMP0042 is not set: MACOSX_RPATH is enabled by default.  Run "cmake
  --help-policy CMP0042" for policy details.  Use the cmake_policy command to
  set the policy and suppress this warning.

  MACOSX_RPATH is not specified for the following targets:

   mylib

This warning is for project developers.  Use -Wno-dev to suppress it.

-- Generating done
-- Build files have been written to: /Users/milanlajtos/Downloads/prototyping-lualink/build
Scanning dependencies of target mylib
[100%] Building CXX object CMakeFiles/mylib.dir/mylib.cpp.o
Linking CXX shared library libmylib.dylib
Undefined symbols for architecture x86_64:
  "_luaL_register", referenced from:
      _luaopen_mylib in mylib.cpp.o
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make[2]: *** [libmylib.dylib] Error 1
make[1]: *** [CMakeFiles/mylib.dir/all] Error 2
make: *** [all] Error 2
luajit: test.lua:1: module 'mylib' not found:
    no field package.preload['mylib']
    no file './mylib.lua'
    no file '/Users/milanlajtos/torch/install/share/luajit-2.1.0-alpha/mylib.lua'
    no file '/usr/local/share/lua/5.1/mylib.lua'
    no file '/usr/local/share/lua/5.1/mylib/init.lua'
    no file '/Users/milanlajtos/torch/install/share/lua/5.1/mylib.lua'
    no file '/Users/milanlajtos/torch/install/share/lua/5.1/mylib/init.lua'
    no file 'build/libmylib.so'
stack traceback:
    [C]: in function 'require'
    test.lua:1: in main chunk
    [C]: at 0x010197b630
hughperkins commented 9 years ago

Hi. Thanks! Can you git git pull, and retry please?

hughperkins commented 9 years ago

(by the way, relevant page I'm using to help figure htis out: http://lua-users.org/wiki/BuildingModules )

hughperkins commented 9 years ago

Hmmm, looks like what I might need is what's in TorchPackage.cmake, ie https://github.com/torch/torch7/blob/master/cmake/TorchPackage.cmake#L20 :

    IF(APPLE)
      SET_TARGET_PROPERTIES(${package} PROPERTIES
        LINK_FLAGS "-undefined dynamic_lookup")
    ENDIF()
hughperkins commented 9 years ago

Hi, updated to use the SET_TARGET_PROPERTIES, as per TorchPackage.cmake. Can you git pull, and send the output of running please?

hughperkins commented 9 years ago

(Alternatively, I've gone ahead and updated EasyCL etc with this linker option, so you could just try to reinstall cltorch, from the new-clew branch, and then trying the prototype only if the main instlal still fails)

(Edited to specify should use new-clew branch)

mlajtos commented 9 years ago

(Sorry, I was away.)

I've tried new-clew branch and build failed with this:

Linking CXX shared library libTHCl.dylib
Undefined symbols for architecture x86_64:
  "___clewCreateBuffer", referenced from:
      CLKernel* CLKernel::input<TensorInfoCl>(int, TensorInfoCl const*) in THClKernels.cpp.o
      CLKernel* CLKernel::inout<TensorInfoCl const>(int, TensorInfoCl const*) in THClKernels.cpp.o
      CLKernel* CLKernel::output<TensorInfoCl const>(int, TensorInfoCl const*) in THClKernels.cpp.o
      CLKernel* CLKernel::input<TensorInfoCl>(int, TensorInfoCl const*) in THClGather.cpp.o
      CLKernel* CLKernel::input<TensorInfoCl>(int, TensorInfoCl const*) in THClScatter.cpp.o
  "___clewReleaseEvent", referenced from:
      _THClBlas_gemv in THClBlas.cpp.o
      _THClBlas_ger in THClBlas.cpp.o
      _THClBlas_gemm in THClBlas.cpp.o
  "___clewSetKernelArg", referenced from:
      CLKernel* CLKernel::input<TensorInfoCl>(int, TensorInfoCl const*) in THClKernels.cpp.o
      CLKernel* CLKernel::inout<TensorInfoCl const>(int, TensorInfoCl const*) in THClKernels.cpp.o
      CLKernel* CLKernel::output<TensorInfoCl const>(int, TensorInfoCl const*) in THClKernels.cpp.o
      CLKernel* CLKernel::input<TensorInfoCl>(int, TensorInfoCl const*) in THClGather.cpp.o
      CLKernel* CLKernel::input<TensorInfoCl>(int, TensorInfoCl const*) in THClScatter.cpp.o
  "___clewWaitForEvents", referenced from:
      _THClBlas_dot in THClBlas.cpp.o
      _THClBlas_gemv in THClBlas.cpp.o
      _THClBlas_ger in THClBlas.cpp.o
      _THClBlas_gemm in THClBlas.cpp.o
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make[2]: *** [lib/THCl/libTHCl.dylib] Error 1
make[1]: *** [lib/THCl/CMakeFiles/THCl.dir/all] Error 2
make: *** [all] Error 2

Error: Build error: Failed building.

Also pulled changes for prototyping-lualink and everything went as supposed:

Scanning dependencies of target mylib
[100%] Building CXX object CMakeFiles/mylib.dir/mylib.cpp.o
Linking CXX shared library libmylib.dylib
[100%] Built target mylib
hello!
hello!
hery commented 9 years ago

Hello! Same error here. Will try to look more into it as well!

hughperkins commented 9 years ago

Hi. Ok, good news that the lua-linking problem is fixed :-) The clew linking is a bit curious. I would have thought that it would be sufficient to link with EasyCL, which already contains these symbols, at least kind of unbound (not sure what is the technical term for this?), but maybe on mac, unbound symbols from one so are not considered 'good enough' for linking with another?

On ubuntu I have the following results from nm, on the relevant so's:

user@pear:~/torch/install/lib$ nm libEasyCL.so | grep clew
                 U __clewBuildProgram
                 U __clewCreateBuffer
                 U __clewCreateCommandQueue
                 U __clewCreateContext
                 U __clewCreateKernel
                 U __clewCreateProgramWithSource
                 U __clewEnqueueCopyBuffer
                 U __clewEnqueueNDRangeKernel
                 U __clewEnqueueReadBuffer
                 U __clewEnqueueWriteBuffer
                 U __clewFinish
                 U __clewGetDeviceIDs
                 U __clewGetDeviceInfo
                 U __clewGetPlatformIDs
                 U __clewGetPlatformInfo
                 U __clewGetProgramBuildInfo
                 U clewInit
                 U __clewReleaseCommandQueue
                 U __clewReleaseContext
                 U __clewReleaseEvent
                 U __clewReleaseKernel
                 U __clewReleaseMemObject
                 U __clewReleaseProgram
                 U __clewSetKernelArg
                 U __clewWaitForEvents
user@pear:~/torch/install/lib$ nm libTHCl.so | grep clew
                 U __clewCreateBuffer
                 U __clewReleaseEvent
                 U __clewSetKernelArg
                 U __clewWaitForEvents

ie, __clewSetKernelArg is in both, and unbound in both. At linktime, for libTHCl.dylib, __clewSetKernelArg comes from libEasyCL.dylib. At runtime, __clewSetKernelArg will come from clew:

user@pear:~/torch/install/lib$ ldd libEasyCL.so 
    linux-vdso.so.1 =>  (0x00007fff5a1c2000)
    libclew.so.1.0.0 => /home/user/torch/install/lib/libclew.so.1.0.0 (0x00007f68468aa000)
     ...
user@pear:~/torch/install/lib$ nm libclew.so | grep clew | grep "Arg"
00000000002051d8 B __clewSetKernelArg

clew is a proxy for opencl library, so we dont need to bind with it either during build, or during loading at runtime, just bind at the time we actually want to use it. Actually it's somewhat optional to use clew if we are creating a module that will only ever be used in the presence of an opencl-enabled device, as for cltorch, but anyway, I use it everywhere, and it works well for me, so I till now saw no reason to not use it :-)

I'm going to guess that on a mac, we have two options:

hughperkins commented 9 years ago

I'm currently tentatively leaning towards adding in the -undefined dynamic_lookup linker option, which will hopefully squelch all the other platform-specific errors I've been seeing on mac builds.

hughperkins commented 9 years ago

Ok, so I've updated libTHCl.so to link also with libclew.so:

user@pear:~/git/cltorch$ ldd ~/torch/install/lib/libTHCl.so 
   ...
    libEasyCL.so => /home/user/torch/install/lib/../lib/libEasyCL.so (0x00007ffbb5015000)
    libclew.so.1.0.0 => /home/user/torch/install/lib/../lib/libclew.so.1.0.0 (0x00007ffbb4e0f000)
  ...

Can you pull down the latest version from new-clew branch, (ie git pull), and try again?

hughperkins commented 9 years ago

(Hmmm, I contradicted myself, and did it the opposite way I said I would do :-P Anyway...)

szagoruyko commented 9 years ago

new-clew works for me:

th> cltorch.getDeviceProperties(1)
{
  deviceType : "CPU"
  maxClockFrequency : 2300
  deviceName : "Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz"
  maxMemAllocSizeMB : 4096
  globalMemCachelineSizeKB : 6144
  deviceVersion : "OpenCL 1.2 "
  localMemSizeKB : 32
  openClCVersion : "OpenCL C 1.2 "
  maxWorkGroupSize : 1024
  globalMemSizeMB : 16384
  platformVendor : "Apple"
  maxComputeUnits : 8
}
                                                                      [0.0006s]
th> cltorch.getDeviceProperties(2)
{
  deviceType : "GPU"
  maxClockFrequency : 1200
  deviceName : "HD Graphics 4000"
  maxMemAllocSizeMB : 256
  globalMemCachelineSizeKB : 0
  deviceVersion : "OpenCL 1.2 "
  localMemSizeKB : 64
  openClCVersion : "OpenCL C 1.2 "
  maxWorkGroupSize : 512
  globalMemSizeMB : 1024
  platformVendor : "Apple"
  maxComputeUnits : 16
}
                                                                      [0.0002s]
th> cltorch.getDeviceProperties(3)
{
  deviceType : "GPU"
  maxClockFrequency : 774
  deviceName : "GeForce GT 650M"
  maxMemAllocSizeMB : 128
  globalMemCachelineSizeKB : 0
  deviceVersion : "OpenCL 1.2 "
  localMemSizeKB : 48
  openClCVersion : "OpenCL C 1.2 "
  maxWorkGroupSize : 1024
  globalMemSizeMB : 512
  platformVendor : "Apple"
  maxComputeUnits : 2
}
mlajtos commented 9 years ago

Pulled changes, build was successful, loading failed:

th> require 'cltorch'
...s/milanlajtos/torch/install/share/lua/5.1/trepl/init.lua:319: ...milanlajtos/torch/install/share/lua/5.1/cltorch/init.lua:19: cannot load '/Users/milanlajtos/torch/install/lib/lua/5.1/libcltorch.so'
stack traceback:
    [C]: in function 'error'
    ...s/milanlajtos/torch/install/share/lua/5.1/trepl/init.lua:319: in function 'f'
    [string "local f = function() return require 'cltorch'..."]:1: in main chunk
    [C]: in function 'xpcall'
    ...s/milanlajtos/torch/install/share/lua/5.1/trepl/init.lua:583: in function 'repl'
    ...jtos/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:185: in main chunk
    [C]: at 0x0107ae4630
hughperkins commented 9 years ago

@szagoruyko Ok, cool :-) @mlajtos hmmm, ok, pondering...

hughperkins commented 9 years ago

@mlajtos : can you confirm that other libraries load for you, eg require 'nn'?

mlajtos commented 9 years ago

Wait, fish is not properly loading some paths, bash is okay. Fixed! Thank you very much, @hughperkins! :)

hughperkins commented 9 years ago

Ok, cool :-) So, no more seg fault, and builds ok, and at least loads ok now?

mlajtos commented 9 years ago

It loads and it seems it works as it should:

th> c = torch.ClTensor{7,4,5}
Using Apple platform: Apple
Using device: Intel(R) Core(TM) i5-3427U CPU @ 1.80GHz
                                                                      [0.0040s] 
th> c
 7
 4
 5
[torch.ClTensor of size 3]

:)

hughperkins commented 9 years ago

Cool :-)

szagoruyko commented 9 years ago

@hughperkins how do you run tests? would be nice to have cltorch.test() as in cutorch

hughperkins commented 9 years ago

I do:

./run-unit-tensor.sh

... but you're right that I should ... oh... I see ... I dont have to migrate my tests to another testing api, simply add in a cltorch.test() function? Seems doable. Will ponder this.

hughperkins commented 9 years ago

Ok. Merged to master, and seems that installing via luarocks install cltorch works ok, at least on my particular platform => closing.

mlajtos commented 9 years ago

run-unit-tensor.sh does not exist, to I tried run-test-storage.sh and got following:

$ ./run-test-storage.sh 
./run-test-storage.sh: line 5: /Users/milanlajtos/torch/activate: No such file or directory

However, I managed to use cltorch with Karpathy's char-rnn. Got ton of errors, and was twice slower as CPU, but it worked.

Again, thank you for your effort. You rock!

hughperkins commented 9 years ago

@mlajtos yes, char-rnn runs on opencl, but is not super speedy yet. One step at a time :-)

hughperkins commented 9 years ago

(By the way, what errors do you get when you run char-rnn? (For me it runs clean, just not very speedy) )

Ambext commented 9 years ago

cltorch and clnn builds fine now, but running char-rnn doesn't work at all

the ouput is the following

Exowide:char-rnn mnemonis$ th train.lua -data_dir data/tinyshakespeare/ -opencl 1 registering spatialconvolutionmm using OpenCL on GPU 0...
loading data files...
cutting off end of data so that the batches/sequences divide evenly reshaping tensor... data load done. Number of data batches in train: 423, val: 23, test: 0
vocab size: 65
creating an LSTM with 2 layers
Using Apple platform: Apple Using device: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz statefultimer v0.6 THClApply.cl build log:

:27:7: warning: no previous prototype for function 'reduceOp' float reduceOp(float _in1, float _in2) { ^ :47:6: warning: no previous prototype for function 'TensorInfo_isContiguous' bool TensorInfo_isContiguous( TensorInfoCl tensorInfo ) { ^ :55:14: warning: no previous prototype for function 'IndexToOffset_998_get' unsigned int IndexToOffset_998_get(unsigned int linearId, const TensorInfoCl info) { ^ :59:14: warning: no previous prototype for function 'IndexToOffset_999_get' unsigned int IndexToOffset_999_get(unsigned int linearId, const TensorInfoCl info) { ^ :74:14: warning: no previous prototype for function 'getLinearBlockId' unsigned int getLinearBlockId() { ^ :82:7: warning: no previous prototype for function 'reduceBlock' float reduceBlock( local float\* smem, ^ :90:23: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int' if (get_local_id(0) < numVals) { ~~~~~~~~~~~~~~~ ^ ~~~~~~~ :97:31: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int' float r = get_local_id(0) < numVals ? smem[get_local_id(0)] : init; ~~~~~~~~~~~~~~~ ^ ~~~~~~~ :196:6: warning: no previous prototype for function 'op' void op( global float *out ^ kernel source: 1: // OpenCL kernels.... 2: 3: // expected templated values: 4: // dims (vector of unique dimension values) 5: // operation 6: // dim1 7: // dim2 8: // dim3 9: // ... dimD 10: // num_input_tensors 11: // include_scalar_input 12: // 13: // maybe should add: 14: // IndexType (hardcoded to int for now) 15: // MAX_CUTORCH_DIMS (hardcoded to 25 for now) 16: 17: // (Ported from cutorch's THCApply.cuh) 18: 19: // Maximum number of dimensions allowed for cutorch 20: // #define MAX_CUTORCH_DIMS 25 21: 22: // Enum that indicates whether tensor arguments are read/write or 23: // read-only 24: //enum TensorArgType { ReadWrite, ReadOnly }; 25: 26: // not used by this kernel, but used by THClReduceApplyUtils... 27: float reduceOp(float _in1, float _in2) { 28: return 0; 29: } 30: 31: // kernel argument that defines tensor layout 32: typedef struct TensorInfoCl { 33: // Extracts size/stride information for the kernel. 34: // Successive dimensions can be collapsed if the size/strides match 35: // up and thus there are no holes between the dimensions. This is used 36: // to reduce the complexity of the problem. 37: // The optional `reduceDim` indicates a reduction dimension for the 38: // given tensor, so that the output size for this dimension will be 1. 39: 40: unsigned int sizes[25]; 41: unsigned int strides[25]; 42: unsigned int offset; 43: int dims; 44: } TensorInfoCl; 45: // Contiguous tensors of more than one dimension are collapsed down 46: // to one tensor 47: bool TensorInfo_isContiguous( TensorInfoCl tensorInfo ) { 48: return (tensorInfo.dims == 1 && tensorInfo.strides[0] == 1); 49: } 50: 51: // Translate a linear index for the apply to a float\* offset; 52: // specialized on `Dims` to reduce nvcc compilation time 53: 54: 55: unsigned int IndexToOffset_998_get(unsigned int linearId, const TensorInfoCl info) { 56: return linearId + info.offset; 57: } 58: 59: unsigned int IndexToOffset_999_get(unsigned int linearId, const TensorInfoCl info) { 60: unsigned int offset = info.offset; 61: 62: // Use dynamic dims 63: for (int i = info.dims - 1; i >= 0; --i) { 64: unsigned int curDimIndex = linearId % info.sizes[i]; 65: unsigned int curDimOffset = curDimIndex \* info.strides[i]; 66: offset += curDimOffset; 67: 68: linearId /= info.sizes[i]; 69: } 70: 71: return offset; 72: } 73: 74: unsigned int getLinearBlockId() { 75: return get_group_id(2) \* get_num_groups(1) \* get_num_groups(0) + 76: get_group_id(1) \* get_num_groups(0) + 77: get_group_id(0); 78: } 79: 80: // Block-wide reduction in shared memory helper; only /_threadIdx.x_/ get_local_id(0) == 0 will 81: // return the reduced value 82: float reduceBlock( local float\* smem, 83: int numVals, 84: float threadVal, 85: float init) { 86: if (numVals == 0) { 87: return init; 88: } 89: 90: if (get_local_id(0) < numVals) { 91: smem[ get_local_id(0)] = threadVal; 92: } 93: 94: // First warp will perform reductions across warps 95: barrier(CLK_LOCAL_MEM_FENCE); 96: if ((get_local_id(0) / 32) == 0) { 97: float r = get_local_id(0) < numVals ? smem[get_local_id(0)] : init; 98: 99: for (int i = 32 + get_local_id(0); i < numVals; i += 32) { 100: r = reduceOp(r, smem[i]); 101: } 102: 103: smem[get_local_id(0)] = r; 104: } 105: 106: // First thread will perform reductions across the block 107: barrier(CLK_LOCAL_MEM_FENCE); 108: 109: float r = init; 110: if (get_local_id(0) == 0) { 111: r = smem[0]; 112: 113: int numLanesParticipating = min(numVals, 32); 114: 115: if (numLanesParticipating == 32) { 116: // Unroll for 32 == 32 and numVals >= 32 117: // #pragma unroll 118: // unrolling by hand, so compiler-independent 119: 120: r = reduceOp(r, smem[1]); 121: 122: r = reduceOp(r, smem[2]); 123: 124: r = reduceOp(r, smem[3]); 125: 126: r = reduceOp(r, smem[4]); 127: 128: r = reduceOp(r, smem[5]); 129: 130: r = reduceOp(r, smem[6]); 131: 132: r = reduceOp(r, smem[7]); 133: 134: r = reduceOp(r, smem[8]); 135: 136: r = reduceOp(r, smem[9]); 137: 138: r = reduceOp(r, smem[10]); 139: 140: r = reduceOp(r, smem[11]); 141: 142: r = reduceOp(r, smem[12]); 143: 144: r = reduceOp(r, smem[13]); 145: 146: r = reduceOp(r, smem[14]); 147: 148: r = reduceOp(r, smem[15]); 149: 150: r = reduceOp(r, smem[16]); 151: 152: r = reduceOp(r, smem[17]); 153: 154: r = reduceOp(r, smem[18]); 155: 156: r = reduceOp(r, smem[19]); 157: 158: r = reduceOp(r, smem[20]); 159: 160: r = reduceOp(r, smem[21]); 161: 162: r = reduceOp(r, smem[22]); 163: 164: r = reduceOp(r, smem[23]); 165: 166: r = reduceOp(r, smem[24]); 167: 168: r = reduceOp(r, smem[25]); 169: 170: r = reduceOp(r, smem[26]); 171: 172: r = reduceOp(r, smem[27]); 173: 174: r = reduceOp(r, smem[28]); 175: 176: r = reduceOp(r, smem[29]); 177: 178: r = reduceOp(r, smem[30]); 179: 180: r = reduceOp(r, smem[31]); 181: 182: } else { 183: for (int i = 1; i < numLanesParticipating; ++i) { 184: r = reduceOp(r, smem[i]); 185: } 186: } 187: } 188: 189: return r; 190: } 191: 192: 193: 194: 195: 196: void op( global float _out 197: 198: 199: , float val1 200: 201: ) { 202: *out = val1; 203: } 204: 205: kernel void 206: THClTensor_pointwiseApplyD( 207: 208: global TensorInfoCl *info_1, 209: global float_data_1, 210: 211: 212: float val1, 213: 214: int totalElements) { 215: for (int linearIndex = get_global_id(0); 216: linearIndex < totalElements; 217: linearIndex += get_global_size(0)) { 218: 219: // Convert `linearIndex` into an offset of `a` 220: const int offset1 = 221: IndexToOffset_998_get(linearIndex, info_1[0]); 222: 223: 224: op( 225: 226: 227: &(data_1[offset1]) 228: 229: 230: , val1 231: 232: ); 233: } 234: } 235: 236: Invalid work group size, code -54 /Users/mnemonis/torch/install/bin/luajit: ./util/model_utils.lua:56: kernel source: 1: // OpenCL kernels.... 2: 3: // expected templated values: 4: // dims (vector of unique dimension values) 5: // operation 6: // dim1 7: // dim2 8: // dim3 9: // ... dimD 10: // num_input_tensors 11: // include_scalar_input 12: // 13: // maybe should add: 14: // IndexType (hardcoded to int for now) 15: // MAX_CUTORCH_DIMS (hardcoded to 25 for now) 16: 17: // (Ported from cutorch's THCApply.cuh) 18: 19: // Maximum number of dimensions allowed for cutorch 20: // #define MAX_CUTORCH_DIMS 25 21: 22: // Enum that indicates whether tensor arguments are read/write or 23: // read-only 24: //enum TensorArgType { ReadWrite, ReadOnly }; 25: 26: // not used by this kernel, but used by THClReduceApplyUtils... 27: float reduceOp(float _in1, float _in2) { 28: return 0; 29: } 30: 31: // kernel argument that defines tensor layout 32: typedef struct TensorInfoCl { 33: // Extracts size/stride information for the kernel. 34: // Successive dimensions can be collapsed if the size/strides match 35: // up and thus there are no holes between the dimensions. This is used 36: // to reduce the complexity of the problem. 37: // The optional `reduceDim` indicates a reduction dimension for the 38: // given tensor, so that the output size for this dimension will be 1. 39: 40: unsigned int sizes[25]; 41: unsigned int strides[25]; 42: unsigned int offset; 43: int dims; 44: } TensorInfoCl; 45: // Contiguous tensors of more than one dimension are collapsed down 46: // to one tensor 47: bool TensorInfo_isContiguous( TensorInfoCl tensorInfo ) { 48: return (tensorInfo.dims == 1 && tensorInfo.strides[0] == 1); 49: } 50: 51: // Translate a linear index for the apply to a float\* offset; 52: // specialized on `Dims` to reduce nvcc compilation time 53: 54: 55: unsigned int IndexToOffset_998_get(unsigned int linearId, const TensorInfoCl info) { 56: return linearId + info.offset; 57: } 58: 59: unsigned int IndexToOffset_999_get(unsigned int linearId, const TensorInfoCl info) { 60: unsigned int offs stack traceback: [C]: in function 'fill' ./util/model_utils.lua:56: in function 'flatten' ./util/model_utils.lua:103: in function 'combine_all_parameters' train.lua:160: in main chunk [C]: in function 'dofile' ...onis/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk [C]: at 0x0103850320
hughperkins commented 9 years ago

@Ambext Ah, thanks. I can probably fix this (I mean, I can.. and probalby should), but ... it looks like you are running this on the cpu? I mean, not on the cpu's internal graphics gpu, but on the actual 'cpu' bit of the cpu. cltorch doesnt target such a scenario at all. I mean, it probably could, but I think it's probably better to get things running ok-ish on the 'gpu' part of cpus first. Can you try choosing a different device, using the -gpuid option?

(Note: you can view which device is which, by doing, th -l cltorch, and then cltorch.getDeviceProperties(1), cltorch.getDeviceProperties(2), etc. I think you need to subtract one from this number, when you call into char-rnn's train.lua.

(Edit, basically, in the bit where it says Using device: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz, I would expect it to say instead something like using device: HD Graphics 4000, as per szagoruyko's example above

hughperkins commented 9 years ago

(hmmm, perhaps I should make getDeviceProperties, and setDevice only expose GPU-type devices, including integrated graphics, but not including cpu-cpu bit. Seems no obvious reason for exposing unusable bits in fact?)

Ambext commented 9 years ago

referring to char-rnn "If you'd like to use OpenCL GPU computing, you'll first need to install the cltorch and clnn packages, and then use the option -opencl 1 during training"

any gpuid call (besides gpuid -1) yields a cutorch / cunn not installed error any opencl X with X >= 2 just executes with no error or message ( I assume it just falls back on the CPU) opencl 1 returns the error mentioned above.