Open mihailescu2m opened 7 years ago
Just to be sure, I ran the configure
script manually, and selected the CUDA installation.
I still have 2 problems:
1) After compiling and installing coriander (after the build_coriander script) I was able to compile and run the cuda sample program:
odroid@odroid:~/src/cuda$ ./cuda_sample
OpenCL platform: ARM Platform
OpenCL device: Mali-T628
hostFloats[2] 123
hostFloats[2] 222
hostFloats[2] 444
However, trying it again now, compilation fails with a segmentation fault:
odroid@odroid:~/src/tensorflow$ cocl cuda_sample.cu
cocl args: cuda_sample.cu
+ /home/odroid/src/tensorflow/tf-coriander/soft/llvm-4.0/bin/clang++ -DUSE_CLEW -std=c++11 -x cuda -D__CORIANDERCC__ --cuda-gpu-arch=sm_30 -nocudalib -nocudainc --cuda-device-only -emit-llvm -O0 -S -D__CUDACC__ -Wno-gnu-anonymous-struct -Wno-nested-anon-types -I/home/odroid/src/tensorflow/tf-coriander/soft/llvm-4.0/include -D_GNU_SOURCE -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/home/odroid/src/tensorflow/tf-coriander/soft/llvm-4.0/include -fPIC -fvisibility-inlines-hidden -Wall -W -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -pedantic -Wno-long-long -Wcovered-switch-default -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wstring-conversion -Werror=date-time -std=c++11 -ffunction-sections -fdata-sections -O3 -fexceptions -D_GNU_SOURCE -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/usr/local/include/EasyCL -I/usr/local/include/cocl -I/usr/local/src -I/usr/local/src/EasyCL -I/usr/local/src/EasyCL/thirdparty/clew/include -include /usr/local/include/cocl/cocl.h -include /usr/local/include/cocl/fake_funcs.h -include /usr/local/include/cocl/cocl_deviceside.h -I/usr/local/include ./cuda_sample.cu -o ./cuda_sample-device-noopt.ll
/usr/local/bin/cocl_wrapped: line 393: 12486 Segmentation fault ${CLANG_HOME}/bin/clang++ ${PASSTHRU} -DUSE_CLEW -std=c++11 -x cuda -D__CORIANDERCC__ --cuda-gpu-arch=sm_30 -nocudalib -nocudainc --cuda-device-only -emit-llvm -O0 -S ${ADDFLAGS} -D__CUDACC__ -Wno-gnu-anonymous-struct -Wno-nested-anon-types ${LLVM_COMPILE_FLAGS} -I${COCL_HOME}/include/EasyCL -I${COCL_HOME}/include/cocl -I${COCL_HOME}/src -I${COCL_HOME}/src/EasyCL -I${COCL_HOME}/src/EasyCL/thirdparty/clew/include -include ${COCL_HOME}/include/cocl/cocl.h -include ${COCL_HOME}/include/cocl/fake_funcs.h -include ${COCL_HOME}/include/cocl/cocl_deviceside.h -I${COCL_HOME}/include ${INCLUDES} ${INPUTBASEPATH}${INPUTPOSTFIX} -o ${OUTPUTBASEPATH}-device-noopt.ll
Any ideas why this happens?
2) TF fails to build with:
ERROR: /home/odroid/.cache/bazel/_bazel_odroid/bf26e11cb9ace4c240bd220fa38d9ec4/external/protobuf/BUILD:73:1: undeclared inclusion(s) in rule '@protobuf//:protobuf_lite':
this rule is missing dependency declarations for the following files included by 'external/protobuf/src/google/protobuf/stubs/time.cc':
'/usr/arm-linux-gnueabihf/include/stdc-predef.h'
'/usr/arm-linux-gnueabihf/include/features.h'
'/usr/arm-linux-gnueabihf/include/sys/cdefs.h'
'/usr/arm-linux-gnueabihf/include/bits/wordsize.h'
'/usr/arm-linux-gnueabihf/include/gnu/stubs.h'
'/usr/arm-linux-gnueabihf/include/gnu/stubs-hard.h'
'/usr/arm-linux-gnueabihf/include/wchar.h'
'/usr/arm-linux-gnueabihf/include/stdio.h'
'/usr/lib/gcc/arm-linux-gnueabihf/6/include/stdarg.h'
'/usr/arm-linux-gnueabihf/include/bits/wchar.h'
'/usr/lib/gcc/arm-linux-gnueabihf/6/include/stddef.h'
'/usr/arm-linux-gnueabihf/include/xlocale.h'
'/usr/lib/gcc/arm-linux-gnueabihf/6/include/stdint.h'
'/usr/arm-linux-gnueabihf/include/stdint.h'
'/usr/arm-linux-gnueabihf/include/locale.h'
'/usr/arm-linux-gnueabihf/include/bits/locale.h'
'/usr/arm-linux-gnueabihf/include/ctype.h'
'/usr/arm-linux-gnueabihf/include/bits/types.h'
'/usr/arm-linux-gnueabihf/include/bits/typesizes.h'
'/usr/arm-linux-gnueabihf/include/endian.h'
'/usr/arm-linux-gnueabihf/include/bits/endian.h'
'/usr/arm-linux-gnueabihf/include/bits/byteswap.h'
'/usr/arm-linux-gnueabihf/include/bits/byteswap-16.h'
'/usr/arm-linux-gnueabihf/include/pthread.h'
'/usr/arm-linux-gnueabihf/include/sched.h'
'/usr/arm-linux-gnueabihf/include/time.h'
'/usr/arm-linux-gnueabihf/include/bits/sched.h'
'/usr/arm-linux-gnueabihf/include/bits/time.h'
'/usr/arm-linux-gnueabihf/include/bits/timex.h'
'/usr/arm-linux-gnueabihf/include/bits/pthreadtypes.h'
'/usr/arm-linux-gnueabihf/include/bits/setjmp.h'
'/usr/arm-linux-gnueabihf/include/stdlib.h'
'/usr/arm-linux-gnueabihf/include/bits/waitflags.h'
'/usr/arm-linux-gnueabihf/include/bits/waitstatus.h'
'/usr/arm-linux-gnueabihf/include/sys/types.h'
'/usr/arm-linux-gnueabihf/include/sys/select.h'
'/usr/arm-linux-gnueabihf/include/bits/select.h'
'/usr/arm-linux-gnueabihf/include/bits/sigset.h'
'/usr/arm-linux-gnueabihf/include/sys/sysmacros.h'
'/usr/arm-linux-gnueabihf/include/alloca.h'
'/usr/arm-linux-gnueabihf/include/bits/stdlib-float.h'
'/usr/arm-linux-gnueabihf/include/libio.h'
'/usr/arm-linux-gnueabihf/include/_G_config.h'
'/usr/arm-linux-gnueabihf/include/bits/stdio_lim.h'
'/usr/arm-linux-gnueabihf/include/bits/sys_errlist.h'
'/usr/arm-linux-gnueabihf/include/errno.h'
'/usr/arm-linux-gnueabihf/include/bits/errno.h'
'/usr/arm-linux-gnueabihf/include/linux/errno.h'
'/usr/arm-linux-gnueabihf/include/asm/errno.h'
'/usr/arm-linux-gnueabihf/include/asm-generic/errno.h'
'/usr/arm-linux-gnueabihf/include/asm-generic/errno-base.h'
'/usr/arm-linux-gnueabihf/include/assert.h'
'/usr/arm-linux-gnueabihf/include/string.h'
'/usr/arm-linux-gnueabihf/include/sys/param.h'
'/usr/lib/gcc/arm-linux-gnueabihf/6/include-fixed/limits.h'
'/usr/lib/gcc/arm-linux-gnueabihf/6/include-fixed/syslimits.h'
'/usr/arm-linux-gnueabihf/include/limits.h'
'/usr/arm-linux-gnueabihf/include/bits/posix1_lim.h'
'/usr/arm-linux-gnueabihf/include/bits/local_lim.h'
'/usr/arm-linux-gnueabihf/include/linux/limits.h'
'/usr/arm-linux-gnueabihf/include/bits/posix2_lim.h'
'/usr/arm-linux-gnueabihf/include/bits/xopen_lim.h'
'/usr/arm-linux-gnueabihf/include/signal.h'
'/usr/arm-linux-gnueabihf/include/bits/signum.h'
'/usr/arm-linux-gnueabihf/include/bits/siginfo.h'
'/usr/arm-linux-gnueabihf/include/bits/sigaction.h'
'/usr/arm-linux-gnueabihf/include/bits/sigcontext.h'
'/usr/arm-linux-gnueabihf/include/asm/sigcontext.h'
'/usr/arm-linux-gnueabihf/include/bits/sigstack.h'
'/usr/arm-linux-gnueabihf/include/sys/ucontext.h'
'/usr/arm-linux-gnueabihf/include/bits/sigthread.h'
'/usr/arm-linux-gnueabihf/include/bits/param.h'
'/usr/arm-linux-gnueabihf/include/linux/param.h'
'/usr/arm-linux-gnueabihf/include/asm/param.h'
'/usr/arm-linux-gnueabihf/include/asm-generic/param.h'
'/usr/arm-linux-gnueabihf/include/byteswap.h'.
Target @grpc//:grpc_cpp_plugin failed to build
Use --verbose_failures to see the command lines of failed build steps.
So, does this mean that TF will not get GPU support, so coriander even if it translates OpenCL->CUDA, TF will not use it? Is there a way to force TF_NEED_CUDA?
tf-coriander provides an OpenCL implementation of Tensorflow. It doesnt support CUDA devices, except to the extent that such devices provide also an OpenCL-1.2 compatible API.
Just to be sure, I ran the configure script manually, and selected the CUDA installation.
enabling CUDA is not supported. Please reconfigure from scratch.
Note that you can ignore the message saying it wont use a GPU. Please mentally reinterpret this to mean 'wont use CUDA, but will use OpenCL'.
Thanks, I will reconfigure and recompile. I thought CUDA was required (Libs installed and TF support enabled) because I understood you do CUDA - OpenCL translation (TF -> CUDA -> coriander OpenCL), glad to be wrong :)
@hughperkins - re-configuring, but I still get similar errors as above.
Somehow, bazel generates /usr/arm-linux-gnueabihf/include
instead of /usr/include
Another thing which is not clear: do I need CUDA installed or not? I've seen some COCL files that reference /usr/local/cuda-8.0
-- will this work with CUDA 6.5? (I can just symlink /usr/local/cuda-8.0
to /usr/local/cuda-6.5
if it would made things simpler as long as CUDA 6.5 is OK).
Thanks.
CUDA Toolkit is not only not required, but its presence will actually break the compile process :)
Can you uninstall all CUDA Toolkit versions, and then redo from the configure
step please?
(by the way, which files do you see that you feel are referencing /usr/local/cuda-8.0
?)
This makes sense - after uninstalling CUDA I could recompile again using cocl w/o getting the segm fault. I also managed to find the correct file to add my include paths to get rid of the error I was getting before. The CUDA references I saw them when looking for include paths, they didn't actually break anything for me. I still have some errors when compiling some 3rd party packages, but that's for next week :)
Thanks
Interesting to see if this runs on the odroid platform! On Fri, 8 Sep 2017 at 18:45, mihailescu2m notifications@github.com wrote:
This makes sense - after uninstalling CUDA I could recompile again using cocl w/o getting the segm fault. I also managed to find the correct file to add my include paths to get rid of the error I was getting before. The CUDA references I saw them when looking for include paths, they didn't actually break anything for me. I still have some errors when compiling some 3rd party packages, but that's for next week :)
Thanks
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/hughperkins/tf-coriander/issues/63#issuecomment-328071338, or mute the thread https://github.com/notifications/unsubscribe-auth/AEc4Dk8RjczEkLPXTsbDDnzax_0rd_ovks5sgRq9gaJpZM4PPKtZ .
@hughperkins I am now getting this error:
In file included from tensorflow/stream_executor/cl/cl_driver.cc:28:0:
./tensorflow/stream_executor/lib/casts.h: In instantiation of 'Dest perftools::gputools::port::bit_cast(const Source&) [with Dest = long long int; Source = void*]':
tensorflow/stream_executor/cl/cl_driver.cc:1022:61: required from here
./tensorflow/stream_executor/lib/casts.h:90:3: error: static assertion failed: src and dst types must have equal sizes
static_assert(sizeof(Dest) == sizeof(Source),
looks like it's coming from CUdeviceptr pointer = port::bit_cast<CUdeviceptr>(location);
where void *
and CUdeviceptr
have different sizes (on armhf) ... any idea which one has to be fixed?
Interesting. So, the type of location
is a constraint, which is inviolable, fixed by the client application source-code. And so what we need to change is the Coriander definition of CUdeviceptr
, which is here:
https://github.com/hughperkins/coriander/blob/master/include/cocl/cocl_memory.h#L40
typedef long long CUdeviceptr;
I dont remember what constraints led me to choose this type, but note that there are conceptually two sets of constraints:
In any case, given the void *
castable constraint, you could try something like:
class CUDevice {
};
typedef CUDevice *CUdeviceptr;
There might be a few cascading changes required, since the current declarations might assume a non-pointer type, and need tweaking somehow.
I am trying using unsigned int CUdeviceptr
since that's the one used by CUDA as well, hope it won't break other things (coriander) ....
Please dont look at things that assume a EULA click-through etc. Any type you choose should be justified by looking at hte constraints of the client application code, not by looking at CUDA forums etc.
@hughperkins I can compile it now successfully, however I get this error when trying to run something:
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/pywrap_tensorflow.py", line 18, in swig_import_helper
return importlib.import_module(mname)
File "/usr/lib/python3.5/importlib/__init__.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 986, in _gcd_import
File "<frozen importlib._bootstrap>", line 969, in _find_and_load
File "<frozen importlib._bootstrap>", line 958, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 666, in _load_unlocked
File "<frozen importlib._bootstrap>", line 577, in module_from_spec
File "<frozen importlib._bootstrap_external>", line 914, in create_module
File "<frozen importlib._bootstrap>", line 222, in _call_with_frames_removed
ImportError: /usr/local/lib/python3.5/dist-packages/tensorflow/python/_pywrap_tensorflow.so: undefined symbol: _ZN10tensorflow7functor12ApplyAdagradIN5Eigen9GpuDeviceEfEclERKS3_NS2_9TensorMapINS2_6TensorIfLi1ELi1EiEELi16ENS2_11MakePointerEEESB_NS7_INS2_15TensorFixedSizeIKfNS2_5SizesIIEEELi1EiEELi16ESA_EENS7_INS8_ISD_Li1ELi1EiEELi16ESA_EE
Any ideas?
Thanks.
EDIT: full errorlog: http://paste.debian.net/985967/
EDIT 2: I am using unsigned int CUdeviceptr
since it's a 32bit device, dunno if it's related to that
@hughperkins another update: successfully compiled and running with gcc 6.3 (I had to add a few patches)
However, it's not working :(
This is the output when running tensorflow/stream_executor/cl/test/test_simple.py
http://paste.debian.net/986023/
This is the output of clinfo
http://paste.debian.net/986029/
This is the output of tf.device_lib.list_local_devices()
(not including the CL device detection printed on stderr)
http://paste.debian.net/986030/
Seems plausible that it is linked with this line:
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1070] Ignoring gpu device (device: 1, name: Mali-T628, pci bus id: 0000.0000) with Cuda multiprocessor count: 2. The minimum required count is 4. You can adjust this requirement with the env var TF_MIN_GPU_MULTIPROCESSOR_COUNT. cl_driver DeviceAllocate 312985600
Is this the device you are targeting?
@hughperkins it's not ...
The Mali GPU has 6 cores in 2 clusters, and that is a warning that the 2-core cluster is ignored (e.g. there is only gpu0). If I run the same program with TF_MIN_GPU_MULTIPROCESSOR_COUNT=2
, it can use both clusters (gpu0 and gpu1 are available)...
I think the issue is Limit: 0
- it means that the memory is limited to 0... so allocating first chunk fails (You can see there is Total Chunks: 0
for all the bins).
Any idea why the limit is 0? What fails to allocate?? (probably in tensorflow/core/common_runtime/bfc_allocator.cc
)
EDIT: it seems that tensorflow/core/common_runtime/gpu/process_state.cc
creates BFCAllocator
with max_size = 1LL << 36
(64GB), where max_size
is converted with static_cast<int64>
where typedef long long int64
which (on ARM) results in a conversion to a big fat 0.
@hughperkins would you recommend changing int64
from long long
to long
or unsigned int
, or rather to just call the allocator with a value that does not gets cast to 0? (since I get this debug message: cl_driver DeviceAllocate 312985600
- then let's say BFCAllocator
also gets 312985600
?)
@hughperkins another update :) setting the limit to 312985600
seems to allow some stuff to run, like test_simple.py
:
$ python test_simple.py
OpenCL platform: ARM Platform
OpenCL device: Mali-T628
I tensorflow/core/common_runtime/gpu/gpu_device.cc:989] Found device 0 with properties:
name: Mali-T628
major: -1 minor: -1 memoryClockRate (GHz) 600
pciBusID 0000.0000
Total memory: 1.95GiB
Free memory: 498.49MiB
W tensorflow/stream_executor/cl/cl_driver.cc:587] creating context when one is currently active; existing: �\c��D"�ice:
OpenCL platform: ARM Platform
OpenCL device: Mali-T628
I tensorflow/core/common_runtime/gpu/gpu_device.cc:989] Found device 1 with properties:
name: Mali-T628
major: -1 minor: -1 memoryClockRate (GHz) 600
pciBusID 0000.0000
Total memory: 1.95GiB
Free memory: 498.49MiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:877] cannot enable peer access from device ordinal 0 to device ordinal 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:877] cannot enable peer access from device ordinal 0 to device ordinal 1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:877] cannot enable peer access from device ordinal 1 to device ordinal 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:877] cannot enable peer access from device ordinal 1 to device ordinal 1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1011] DMA: 0 1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1021] 0: N N
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1021] 1: N N
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1083] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Mali-T628, pci bus id: 0000.0000)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1070] Ignoring gpu device (device: 1, name: Mali-T628, pci bus id: 0000.0000) with Cuda multiprocessor count: 2. The minimum required count is 4. You can adjust this requirement with the env var TF_MIN_GPU_MULTIPROCESSOR_COUNT.
cl_driver DeviceAllocate 312985600
Device mapping:
/job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: Mali-T628, pci bus id: 0000.0000
I tensorflow/core/common_runtime/direct_session.cc:252] Device mapping:
/job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: Mali-T628, pci bus id: 0000.0000
running sess.run a
c: /job:localhost/replica:0/task:0/gpu:0
I tensorflow/core/common_runtime/simple_placer.cc:819] c: /job:localhost/replica:0/task:0/gpu:0
b: /job:localhost/replica:0/task:0/gpu:0
I tensorflow/core/common_runtime/simple_placer.cc:819] b: /job:localhost/replica:0/task:0/gpu:0
a: /job:localhost/replica:0/task:0/gpu:0
I tensorflow/core/common_runtime/simple_placer.cc:819] a: /job:localhost/replica:0/task:0/gpu:0
+++ kCudaHostMemoryUseBFC calling BFC allocator for 1LL << 36 memory (replaced with 312985600)
[[ 4. 7. 9.]
[ 8. 10. 12.]]
done
HOWEVER, some tests fail:
test_binary_ops.py::test[uint8-mul-a * b] xfail
test_binary_ops.py::test[uint8-div-a / b] xfail
test_binary_ops.py::test[float32-not_equal-np.not_equal(a, b)] PASSED
test_binary_ops.py::test[float32-maximum-np.maximum(a,b)] PASSED
test_binary_ops.py::test[float32-minimum-np.minimum(a,b)] PASSED
test_binary_ops.py::test[float32-pow-np.power(a,b)] PASSED
test_binary_ops.py::test[float32-mul-a * b] PASSED
test_binary_ops.py::test[float32-sub-a - b] PASSED
test_binary_ops.py::test[float32-squared_difference-(a - b) * (a - b)] PASSED
test_binary_ops.py::test[float32-add-a + b] PASSED
test_binary_ops.py::test[float32-div-a / b] PASSED
test_binary_ops.py::test[int32-maximum-np.maximum(a,b)] PASSED
test_binary_ops.py::test[int32-minimum-np.minimum(a,b)] PASSED
test_binary_ops.py::test[int32-mul-a * b] PASSED
test_binary_ops.py::test[int32-sub-a - b] PASSED
test_binary_ops.py::test[int32-squared_difference-(a - b) * (a - b)] PASSED
test_binary_ops.py::test[int32-add-a + b] PASSED
test_binary_ops.py::test[int32-div-a / b] PASSED
test_blas.py::test_blas PASSED
test_gradients.py::test_gradients Aborted
and running linear_regression
exits with the error:
F tensorflow/core/framework/tensor.cc:446] Check failed: IsAligned()
Aborted
Any idea to fix this? :D
I removed the isAligned check completely (no idea what will do... but i didn't get any segm faults yet).
Linear regression: average_epoch_times= 6.146 kernel_compile_time 0.319 Training cost= 0.0851199 W= 0.199474 b= 1.16202
Testing... (Mean square loss Comparison) Testing cost= 0.0981112 Absolute mean square loss difference: 0.0129913
Hi,
I've installed coriander successfully, and now trying to get TF installed on a non-nvidia OpenCL 1.2 GPU (linux). When running util/run_configure.sh, I see that TF does not select CUDA:
So, does this mean that TF will not get GPU support, so coriander even if it translates OpenCL->CUDA, TF will not use it? Is there a way to force
TF_NEED_CUDA
?