hughperkins / tf-coriander

OpenCL 1.2 implementation for Tensorflow
Apache License 2.0
791 stars 90 forks source link

CUDA required? #63

Open mihailescu2m opened 7 years ago

mihailescu2m commented 7 years ago

Hi,

I've installed coriander successfully, and now trying to get TF installed on a non-nvidia OpenCL 1.2 GPU (linux). When running util/run_configure.sh, I see that TF does not select CUDA:

No Google Cloud Platform support will be enabled for TensorFlow
No Hadoop File System support will be enabled for TensorFlow
Found possible Python library paths:
  /usr/lib/python3/dist-packages
  /usr/local/lib/python3.5/dist-packages
Please input the desired Python library path to use.  Default is [/usr/lib/python3/dist-packages]
/usr/lib/python3/dist-packages
No GPU support will be enabled for TensorFlow
checking operating system
Configuration finished

So, does this mean that TF will not get GPU support, so coriander even if it translates OpenCL->CUDA, TF will not use it? Is there a way to force TF_NEED_CUDA?

mihailescu2m commented 7 years ago

Just to be sure, I ran the configure script manually, and selected the CUDA installation.

I still have 2 problems:

1) After compiling and installing coriander (after the build_coriander script) I was able to compile and run the cuda sample program:

odroid@odroid:~/src/cuda$ ./cuda_sample
OpenCL platform: ARM Platform
OpenCL device: Mali-T628
hostFloats[2] 123
hostFloats[2] 222
hostFloats[2] 444

However, trying it again now, compilation fails with a segmentation fault:

odroid@odroid:~/src/tensorflow$ cocl cuda_sample.cu
cocl args: cuda_sample.cu
+ /home/odroid/src/tensorflow/tf-coriander/soft/llvm-4.0/bin/clang++ -DUSE_CLEW -std=c++11 -x cuda -D__CORIANDERCC__ --cuda-gpu-arch=sm_30 -nocudalib -nocudainc --cuda-device-only -emit-llvm -O0 -S -D__CUDACC__ -Wno-gnu-anonymous-struct -Wno-nested-anon-types -I/home/odroid/src/tensorflow/tf-coriander/soft/llvm-4.0/include -D_GNU_SOURCE -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/home/odroid/src/tensorflow/tf-coriander/soft/llvm-4.0/include -fPIC -fvisibility-inlines-hidden -Wall -W -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -pedantic -Wno-long-long -Wcovered-switch-default -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wstring-conversion -Werror=date-time -std=c++11 -ffunction-sections -fdata-sections -O3 -fexceptions -D_GNU_SOURCE -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/usr/local/include/EasyCL -I/usr/local/include/cocl -I/usr/local/src -I/usr/local/src/EasyCL -I/usr/local/src/EasyCL/thirdparty/clew/include -include /usr/local/include/cocl/cocl.h -include /usr/local/include/cocl/fake_funcs.h -include /usr/local/include/cocl/cocl_deviceside.h -I/usr/local/include ./cuda_sample.cu -o ./cuda_sample-device-noopt.ll
/usr/local/bin/cocl_wrapped: line 393: 12486 Segmentation fault      ${CLANG_HOME}/bin/clang++ ${PASSTHRU} -DUSE_CLEW -std=c++11 -x cuda -D__CORIANDERCC__ --cuda-gpu-arch=sm_30 -nocudalib -nocudainc --cuda-device-only -emit-llvm -O0 -S ${ADDFLAGS} -D__CUDACC__ -Wno-gnu-anonymous-struct -Wno-nested-anon-types ${LLVM_COMPILE_FLAGS} -I${COCL_HOME}/include/EasyCL -I${COCL_HOME}/include/cocl -I${COCL_HOME}/src -I${COCL_HOME}/src/EasyCL -I${COCL_HOME}/src/EasyCL/thirdparty/clew/include -include ${COCL_HOME}/include/cocl/cocl.h -include ${COCL_HOME}/include/cocl/fake_funcs.h -include ${COCL_HOME}/include/cocl/cocl_deviceside.h -I${COCL_HOME}/include ${INCLUDES} ${INPUTBASEPATH}${INPUTPOSTFIX} -o ${OUTPUTBASEPATH}-device-noopt.ll

Any ideas why this happens?

2) TF fails to build with:

ERROR: /home/odroid/.cache/bazel/_bazel_odroid/bf26e11cb9ace4c240bd220fa38d9ec4/external/protobuf/BUILD:73:1: undeclared inclusion(s) in rule '@protobuf//:protobuf_lite':
this rule is missing dependency declarations for the following files included by 'external/protobuf/src/google/protobuf/stubs/time.cc':
  '/usr/arm-linux-gnueabihf/include/stdc-predef.h'
  '/usr/arm-linux-gnueabihf/include/features.h'
  '/usr/arm-linux-gnueabihf/include/sys/cdefs.h'
  '/usr/arm-linux-gnueabihf/include/bits/wordsize.h'
  '/usr/arm-linux-gnueabihf/include/gnu/stubs.h'
  '/usr/arm-linux-gnueabihf/include/gnu/stubs-hard.h'
  '/usr/arm-linux-gnueabihf/include/wchar.h'
  '/usr/arm-linux-gnueabihf/include/stdio.h'
  '/usr/lib/gcc/arm-linux-gnueabihf/6/include/stdarg.h'
  '/usr/arm-linux-gnueabihf/include/bits/wchar.h'
  '/usr/lib/gcc/arm-linux-gnueabihf/6/include/stddef.h'
  '/usr/arm-linux-gnueabihf/include/xlocale.h'
  '/usr/lib/gcc/arm-linux-gnueabihf/6/include/stdint.h'
  '/usr/arm-linux-gnueabihf/include/stdint.h'
  '/usr/arm-linux-gnueabihf/include/locale.h'
  '/usr/arm-linux-gnueabihf/include/bits/locale.h'
  '/usr/arm-linux-gnueabihf/include/ctype.h'
  '/usr/arm-linux-gnueabihf/include/bits/types.h'
  '/usr/arm-linux-gnueabihf/include/bits/typesizes.h'
  '/usr/arm-linux-gnueabihf/include/endian.h'
  '/usr/arm-linux-gnueabihf/include/bits/endian.h'
  '/usr/arm-linux-gnueabihf/include/bits/byteswap.h'
  '/usr/arm-linux-gnueabihf/include/bits/byteswap-16.h'
  '/usr/arm-linux-gnueabihf/include/pthread.h'
  '/usr/arm-linux-gnueabihf/include/sched.h'
  '/usr/arm-linux-gnueabihf/include/time.h'
  '/usr/arm-linux-gnueabihf/include/bits/sched.h'
  '/usr/arm-linux-gnueabihf/include/bits/time.h'
  '/usr/arm-linux-gnueabihf/include/bits/timex.h'
  '/usr/arm-linux-gnueabihf/include/bits/pthreadtypes.h'
  '/usr/arm-linux-gnueabihf/include/bits/setjmp.h'
  '/usr/arm-linux-gnueabihf/include/stdlib.h'
  '/usr/arm-linux-gnueabihf/include/bits/waitflags.h'
  '/usr/arm-linux-gnueabihf/include/bits/waitstatus.h'
  '/usr/arm-linux-gnueabihf/include/sys/types.h'
  '/usr/arm-linux-gnueabihf/include/sys/select.h'
  '/usr/arm-linux-gnueabihf/include/bits/select.h'
  '/usr/arm-linux-gnueabihf/include/bits/sigset.h'
  '/usr/arm-linux-gnueabihf/include/sys/sysmacros.h'
  '/usr/arm-linux-gnueabihf/include/alloca.h'
  '/usr/arm-linux-gnueabihf/include/bits/stdlib-float.h'
  '/usr/arm-linux-gnueabihf/include/libio.h'
  '/usr/arm-linux-gnueabihf/include/_G_config.h'
  '/usr/arm-linux-gnueabihf/include/bits/stdio_lim.h'
  '/usr/arm-linux-gnueabihf/include/bits/sys_errlist.h'
  '/usr/arm-linux-gnueabihf/include/errno.h'
  '/usr/arm-linux-gnueabihf/include/bits/errno.h'
  '/usr/arm-linux-gnueabihf/include/linux/errno.h'
  '/usr/arm-linux-gnueabihf/include/asm/errno.h'
  '/usr/arm-linux-gnueabihf/include/asm-generic/errno.h'
  '/usr/arm-linux-gnueabihf/include/asm-generic/errno-base.h'
  '/usr/arm-linux-gnueabihf/include/assert.h'
  '/usr/arm-linux-gnueabihf/include/string.h'
  '/usr/arm-linux-gnueabihf/include/sys/param.h'
  '/usr/lib/gcc/arm-linux-gnueabihf/6/include-fixed/limits.h'
  '/usr/lib/gcc/arm-linux-gnueabihf/6/include-fixed/syslimits.h'
  '/usr/arm-linux-gnueabihf/include/limits.h'
  '/usr/arm-linux-gnueabihf/include/bits/posix1_lim.h'
  '/usr/arm-linux-gnueabihf/include/bits/local_lim.h'
  '/usr/arm-linux-gnueabihf/include/linux/limits.h'
  '/usr/arm-linux-gnueabihf/include/bits/posix2_lim.h'
  '/usr/arm-linux-gnueabihf/include/bits/xopen_lim.h'
  '/usr/arm-linux-gnueabihf/include/signal.h'
  '/usr/arm-linux-gnueabihf/include/bits/signum.h'
  '/usr/arm-linux-gnueabihf/include/bits/siginfo.h'
  '/usr/arm-linux-gnueabihf/include/bits/sigaction.h'
  '/usr/arm-linux-gnueabihf/include/bits/sigcontext.h'
  '/usr/arm-linux-gnueabihf/include/asm/sigcontext.h'
  '/usr/arm-linux-gnueabihf/include/bits/sigstack.h'
  '/usr/arm-linux-gnueabihf/include/sys/ucontext.h'
  '/usr/arm-linux-gnueabihf/include/bits/sigthread.h'
  '/usr/arm-linux-gnueabihf/include/bits/param.h'
  '/usr/arm-linux-gnueabihf/include/linux/param.h'
  '/usr/arm-linux-gnueabihf/include/asm/param.h'
  '/usr/arm-linux-gnueabihf/include/asm-generic/param.h'
  '/usr/arm-linux-gnueabihf/include/byteswap.h'.
Target @grpc//:grpc_cpp_plugin failed to build
Use --verbose_failures to see the command lines of failed build steps.
hughperkins commented 7 years ago

So, does this mean that TF will not get GPU support, so coriander even if it translates OpenCL->CUDA, TF will not use it? Is there a way to force TF_NEED_CUDA?

tf-coriander provides an OpenCL implementation of Tensorflow. It doesnt support CUDA devices, except to the extent that such devices provide also an OpenCL-1.2 compatible API.

hughperkins commented 7 years ago

Just to be sure, I ran the configure script manually, and selected the CUDA installation.

enabling CUDA is not supported. Please reconfigure from scratch.

Note that you can ignore the message saying it wont use a GPU. Please mentally reinterpret this to mean 'wont use CUDA, but will use OpenCL'.

mihailescu2m commented 7 years ago

Thanks, I will reconfigure and recompile. I thought CUDA was required (Libs installed and TF support enabled) because I understood you do CUDA - OpenCL translation (TF -> CUDA -> coriander OpenCL), glad to be wrong :)

mihailescu2m commented 7 years ago

@hughperkins - re-configuring, but I still get similar errors as above. Somehow, bazel generates /usr/arm-linux-gnueabihf/include instead of /usr/include

Another thing which is not clear: do I need CUDA installed or not? I've seen some COCL files that reference /usr/local/cuda-8.0 -- will this work with CUDA 6.5? (I can just symlink /usr/local/cuda-8.0 to /usr/local/cuda-6.5 if it would made things simpler as long as CUDA 6.5 is OK).

Thanks.

hughperkins commented 7 years ago

CUDA Toolkit is not only not required, but its presence will actually break the compile process :)

Can you uninstall all CUDA Toolkit versions, and then redo from the configure step please?

hughperkins commented 7 years ago

(by the way, which files do you see that you feel are referencing /usr/local/cuda-8.0?)

mihailescu2m commented 7 years ago

This makes sense - after uninstalling CUDA I could recompile again using cocl w/o getting the segm fault. I also managed to find the correct file to add my include paths to get rid of the error I was getting before. The CUDA references I saw them when looking for include paths, they didn't actually break anything for me. I still have some errors when compiling some 3rd party packages, but that's for next week :)

Thanks

Dexdev08 commented 7 years ago

Interesting to see if this runs on the odroid platform! On Fri, 8 Sep 2017 at 18:45, mihailescu2m notifications@github.com wrote:

This makes sense - after uninstalling CUDA I could recompile again using cocl w/o getting the segm fault. I also managed to find the correct file to add my include paths to get rid of the error I was getting before. The CUDA references I saw them when looking for include paths, they didn't actually break anything for me. I still have some errors when compiling some 3rd party packages, but that's for next week :)

Thanks

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/hughperkins/tf-coriander/issues/63#issuecomment-328071338, or mute the thread https://github.com/notifications/unsubscribe-auth/AEc4Dk8RjczEkLPXTsbDDnzax_0rd_ovks5sgRq9gaJpZM4PPKtZ .

mihailescu2m commented 7 years ago

@hughperkins I am now getting this error:

In file included from tensorflow/stream_executor/cl/cl_driver.cc:28:0:
./tensorflow/stream_executor/lib/casts.h: In instantiation of 'Dest perftools::gputools::port::bit_cast(const Source&) [with Dest = long long int; Source = void*]':
tensorflow/stream_executor/cl/cl_driver.cc:1022:61:   required from here
./tensorflow/stream_executor/lib/casts.h:90:3: error: static assertion failed: src and dst types must have equal sizes
   static_assert(sizeof(Dest) == sizeof(Source),

looks like it's coming from CUdeviceptr pointer = port::bit_cast<CUdeviceptr>(location); where void * and CUdeviceptr have different sizes (on armhf) ... any idea which one has to be fixed?

hughperkins commented 7 years ago

Interesting. So, the type of location is a constraint, which is inviolable, fixed by the client application source-code. And so what we need to change is the Coriander definition of CUdeviceptr, which is here:

https://github.com/hughperkins/coriander/blob/master/include/cocl/cocl_memory.h#L40

typedef long long CUdeviceptr;

I dont remember what constraints led me to choose this type, but note that there are conceptually two sets of constraints:

In any case, given the void * castable constraint, you could try something like:

class CUDevice {
};

typedef CUDevice *CUdeviceptr;

There might be a few cascading changes required, since the current declarations might assume a non-pointer type, and need tweaking somehow.

mihailescu2m commented 7 years ago

I am trying using unsigned int CUdeviceptr since that's the one used by CUDA as well, hope it won't break other things (coriander) ....

EDIT: https://devtalk.nvidia.com/default/topic/467742/cudeviceptr-should-be-typdedef-39-d-as-void-instead-of-unsigned-int/

hughperkins commented 7 years ago

Please dont look at things that assume a EULA click-through etc. Any type you choose should be justified by looking at hte constraints of the client application code, not by looking at CUDA forums etc.

mihailescu2m commented 7 years ago

@hughperkins I can compile it now successfully, however I get this error when trying to run something:

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/pywrap_tensorflow.py", line 18, in swig_import_helper
    return importlib.import_module(mname)
  File "/usr/lib/python3.5/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 986, in _gcd_import
  File "<frozen importlib._bootstrap>", line 969, in _find_and_load
  File "<frozen importlib._bootstrap>", line 958, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 666, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 577, in module_from_spec
  File "<frozen importlib._bootstrap_external>", line 914, in create_module
  File "<frozen importlib._bootstrap>", line 222, in _call_with_frames_removed
ImportError: /usr/local/lib/python3.5/dist-packages/tensorflow/python/_pywrap_tensorflow.so: undefined symbol: _ZN10tensorflow7functor12ApplyAdagradIN5Eigen9GpuDeviceEfEclERKS3_NS2_9TensorMapINS2_6TensorIfLi1ELi1EiEELi16ENS2_11MakePointerEEESB_NS7_INS2_15TensorFixedSizeIKfNS2_5SizesIIEEELi1EiEELi16ESA_EENS7_INS8_ISD_Li1ELi1EiEELi16ESA_EE

Any ideas?

Thanks.

EDIT: full errorlog: http://paste.debian.net/985967/

EDIT 2: I am using unsigned int CUdeviceptr since it's a 32bit device, dunno if it's related to that

mihailescu2m commented 7 years ago

@hughperkins another update: successfully compiled and running with gcc 6.3 (I had to add a few patches)

However, it's not working :(

This is the output when running tensorflow/stream_executor/cl/test/test_simple.py http://paste.debian.net/986023/

This is the output of clinfo http://paste.debian.net/986029/

This is the output of tf.device_lib.list_local_devices() (not including the CL device detection printed on stderr) http://paste.debian.net/986030/

hughperkins commented 7 years ago

Seems plausible that it is linked with this line:

I tensorflow/core/common_runtime/gpu/gpu_device.cc:1070] Ignoring gpu device (device: 1, name: Mali-T628, pci bus id: 0000.0000) with Cuda multiprocessor count: 2. The minimum required count is 4. You can adjust this requirement with the env var TF_MIN_GPU_MULTIPROCESSOR_COUNT. cl_driver DeviceAllocate 312985600

Is this the device you are targeting?

mihailescu2m commented 7 years ago

@hughperkins it's not ... The Mali GPU has 6 cores in 2 clusters, and that is a warning that the 2-core cluster is ignored (e.g. there is only gpu0). If I run the same program with TF_MIN_GPU_MULTIPROCESSOR_COUNT=2, it can use both clusters (gpu0 and gpu1 are available)...

I think the issue is Limit: 0 - it means that the memory is limited to 0... so allocating first chunk fails (You can see there is Total Chunks: 0 for all the bins). Any idea why the limit is 0? What fails to allocate?? (probably in tensorflow/core/common_runtime/bfc_allocator.cc)

EDIT: it seems that tensorflow/core/common_runtime/gpu/process_state.cc creates BFCAllocator with max_size = 1LL << 36 (64GB), where max_size is converted with static_cast<int64> where typedef long long int64 which (on ARM) results in a conversion to a big fat 0.

@hughperkins would you recommend changing int64 from long long to long or unsigned int, or rather to just call the allocator with a value that does not gets cast to 0? (since I get this debug message: cl_driver DeviceAllocate 312985600 - then let's say BFCAllocator also gets 312985600?)

mihailescu2m commented 7 years ago

@hughperkins another update :) setting the limit to 312985600 seems to allow some stuff to run, like test_simple.py:

$ python test_simple.py 
OpenCL platform: ARM Platform
OpenCL device: Mali-T628
I tensorflow/core/common_runtime/gpu/gpu_device.cc:989] Found device 0 with properties: 
name: Mali-T628
major: -1 minor: -1 memoryClockRate (GHz) 600
pciBusID 0000.0000
Total memory: 1.95GiB
Free memory: 498.49MiB
W tensorflow/stream_executor/cl/cl_driver.cc:587] creating context when one is currently active; existing: �\c��D"�ice:
OpenCL platform: ARM Platform
OpenCL device: Mali-T628
I tensorflow/core/common_runtime/gpu/gpu_device.cc:989] Found device 1 with properties: 
name: Mali-T628
major: -1 minor: -1 memoryClockRate (GHz) 600
pciBusID 0000.0000
Total memory: 1.95GiB
Free memory: 498.49MiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:877] cannot enable peer access from device ordinal 0 to device ordinal 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:877] cannot enable peer access from device ordinal 0 to device ordinal 1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:877] cannot enable peer access from device ordinal 1 to device ordinal 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:877] cannot enable peer access from device ordinal 1 to device ordinal 1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1011] DMA: 0 1 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1021] 0:   N N 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1021] 1:   N N 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1083] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Mali-T628, pci bus id: 0000.0000)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1070] Ignoring gpu device (device: 1, name: Mali-T628, pci bus id: 0000.0000) with Cuda multiprocessor count: 2. The minimum required count is 4. You can adjust this requirement with the env var TF_MIN_GPU_MULTIPROCESSOR_COUNT.
cl_driver DeviceAllocate 312985600
Device mapping:
/job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: Mali-T628, pci bus id: 0000.0000
I tensorflow/core/common_runtime/direct_session.cc:252] Device mapping:
/job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: Mali-T628, pci bus id: 0000.0000

running sess.run a
c: /job:localhost/replica:0/task:0/gpu:0
I tensorflow/core/common_runtime/simple_placer.cc:819] c: /job:localhost/replica:0/task:0/gpu:0
b: /job:localhost/replica:0/task:0/gpu:0
I tensorflow/core/common_runtime/simple_placer.cc:819] b: /job:localhost/replica:0/task:0/gpu:0
a: /job:localhost/replica:0/task:0/gpu:0
I tensorflow/core/common_runtime/simple_placer.cc:819] a: /job:localhost/replica:0/task:0/gpu:0
+++ kCudaHostMemoryUseBFC calling BFC allocator for 1LL << 36 memory (replaced with 312985600)
[[  4.   7.   9.]
 [  8.  10.  12.]]
done

HOWEVER, some tests fail:

test_binary_ops.py::test[uint8-mul-a * b] xfail
test_binary_ops.py::test[uint8-div-a / b] xfail
test_binary_ops.py::test[float32-not_equal-np.not_equal(a, b)] PASSED
test_binary_ops.py::test[float32-maximum-np.maximum(a,b)] PASSED
test_binary_ops.py::test[float32-minimum-np.minimum(a,b)] PASSED
test_binary_ops.py::test[float32-pow-np.power(a,b)] PASSED
test_binary_ops.py::test[float32-mul-a * b] PASSED
test_binary_ops.py::test[float32-sub-a - b] PASSED
test_binary_ops.py::test[float32-squared_difference-(a - b) * (a - b)] PASSED
test_binary_ops.py::test[float32-add-a + b] PASSED
test_binary_ops.py::test[float32-div-a / b] PASSED
test_binary_ops.py::test[int32-maximum-np.maximum(a,b)] PASSED
test_binary_ops.py::test[int32-minimum-np.minimum(a,b)] PASSED
test_binary_ops.py::test[int32-mul-a * b] PASSED
test_binary_ops.py::test[int32-sub-a - b] PASSED
test_binary_ops.py::test[int32-squared_difference-(a - b) * (a - b)] PASSED
test_binary_ops.py::test[int32-add-a + b] PASSED
test_binary_ops.py::test[int32-div-a / b] PASSED
test_blas.py::test_blas PASSED
test_gradients.py::test_gradients Aborted

and running linear_regression exits with the error:

F tensorflow/core/framework/tensor.cc:446] Check failed: IsAligned() 
Aborted

Any idea to fix this? :D

mihailescu2m commented 7 years ago

I removed the isAligned check completely (no idea what will do... but i didn't get any segm faults yet).

Linear regression: average_epoch_times= 6.146 kernel_compile_time 0.319 Training cost= 0.0851199 W= 0.199474 b= 1.16202

Testing... (Mean square loss Comparison) Testing cost= 0.0981112 Absolute mean square loss difference: 0.0129913