ledatelescope / bifrost

A stream processing framework for high-throughput applications.
BSD 3-Clause "New" or "Revised" License
66 stars 29 forks source link

BF_STATUS_DEVICE_ERROR #121

Closed Leeying1235 closed 6 years ago

Leeying1235 commented 6 years ago

Dear authors: I'm a beginner of your framework,i installed bifrost successfully.But when i am trying to import bifrost,i get _BF_STATUS_DEVICEERROR . The detail is below: `Python 2.7.6 (default, Nov 23 2017, 15:49:48) [GCC 4.8.4] on linux2 Type "help", "copyright", "credits" or "license" for more information.

import bifrost WARNING: Install simplejson for better performance WARNING: Install simplejson for better performance cuda.cpp:88 cudaGetErrorString(cuda_ret) = unknown error cuda.cpp:88 Condition failed: cuda_ret == cudaSuccess cuda.cpp:88 error 66: BF_STATUS_DEVICE_ERROR Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python2.7/dist-packages/bifrost-0.8.0-py2.7.egg/bifrost/init.py", line 35, in import pipeline File "/usr/local/lib/python2.7/dist-packages/bifrost-0.8.0-py2.7.egg/bifrost/pipeline.py", line 48, in bf.device.set_devices_no_spin_cpu() File "/usr/local/lib/python2.7/dist-packages/bifrost-0.8.0-py2.7.egg/bifrost/device.py", line 49, in set_devices_no_spin_cpu _check(_bf.bfDevicesSetNoSpinCPU()) File "/usr/local/lib/python2.7/dist-packages/bifrost-0.8.0-py2.7.egg/bifrost/libbifrost.py", line 103, in _check raise RuntimeError(status_str) RuntimeError: BF_STATUS_DEVICE_ERROR`

Could you tell me what's wrong with it and how can i get this problem solved? Thanks, Leeying

telegraphic commented 6 years ago

Hi @Leeying1235 -- this looks like the bifrost code is having trouble finding your CUDA installation. Do you have cuda and a nvidia GPU installed? If so, I think it's probably an issue with libraries.

You may need to set some environment variables in your~/.bashrc (or equivalent). Here's a snippet from mine:

# BIFROST
export LD_LIBRARY_PATH="/usr/local/cuda/lib64:/usr/local/lib:$LD_LIBRARY_PATH"
export PATH=/usr/local/cuda-8.0/bin:$PATH
export BIFROST_INCLUDE_PATH=/usr/local/include/bifrost

# CUDA
export CUDA_ROOT="/usr/local/cuda-8.0"
export CUDA_INC_DIR="/usr/local/cuda-8.0/include"

There are also some common problems listed here: http://ledatelescope.github.io/bifrost/Common-installation-and-execution-problems.html

Please let me know if can figure it out, and I'll add some text to the common problems page.

Leeying1235 commented 6 years ago

Hi @telegraphic ,I have already installed cuda and there are nvidia GPUs in my computer. After i run command 'nvcc -V' and 'lspci |grep -i vga.*nvidia',i got this:

Command: nvcc -V The Output: nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2016 NVIDIA Corporation Built on Tue_Jan_10_13:22:03_CST_2017 Cuda compilation tools, release 8.0, V8.0.61 Command: lspci |grep -i vga.*nvidia The Output: 01:00.0 VGA compatible controller: NVIDIA Corporation GT218 [GeForce 210] (rev a2) 02:00.0 VGA compatible controller: NVIDIA Corporation GK110 [GeForce GTX 780] (rev a1) 03:00.0 VGA compatible controller: NVIDIA Corporation GK110 [GeForce GTX 780] (rev a1) 83:00.0 VGA compatible controller: NVIDIA Corporation GK110 [GeForce GTX 780] (rev a1) 84:00.0 VGA compatible controller: NVIDIA Corporation GK110 [GeForce GTX 780] (rev a1)

The environment variables PATH and LD_BIRARY_PATH has already set in my computer,same like yours,but it doesn't work.And follow your suggestions i tried to add these variables into ~/.bashrc,it doesn't work neither.And the below is value of variables PATH and LD_BIRARY_PATH on my computer: PATH /usr/local/lib:/usr/local/cuda-8.0/bin:/usr/local/cuda-8.0/bin:/home/ly/local/jdk1.8.0_45/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/local/cuda/bin:/usr/lib/jvm/java-8-oracle/bin:/usr/lib/jvm/java-8-oracle/db/bin:/usr/lib/jvm/java-8-oracle/jre/bin LD_BIRARY_PATH /usr/local/cuda/lib64:/usr/local/lib:/usr/local/cuda-8.0/lib64:$LD_LIBRARY_PATH:/usr/local/cuda/lib:/usr/local/cuda/lib64

Looking forward to your reply! Thanks

jaycedowell commented 6 years ago

What does nvidia-smi yield?

telegraphic commented 6 years ago

@Leeying1235 Ok, my next guess is architecture -- I have not tested the code on anything older than a Maxwell card (980). Could you recompile with line 12 of user.mk uncommented, commenting out line 14 instead? https://github.com/ledatelescope/bifrost/blob/master/user.mk#L12

Leeying1235 commented 6 years ago

@telegraphic problem still exists.

Leeying1235 commented 6 years ago

@jaycedowell The output is below:

Wed Oct 10 09:01:00 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66                 Driver Version: 375.66                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 780     Off  | 0000:02:00.0     N/A |                  N/A |
| 28%   49C    P0    N/A /  N/A |      0MiB /  3020MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 780     Off  | 0000:03:00.0     N/A |                  N/A |
| 26%   43C    P0    N/A /  N/A |      0MiB /  3020MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 780     Off  | 0000:83:00.0     N/A |                  N/A |
| 26%   43C    P0    N/A /  N/A |      0MiB /  3020MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 780     Off  | 0000:84:00.0     N/A |                  N/A |
|  0%   43C    P0    N/A /  N/A |      0MiB /  3020MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0                  Not Supported                                         |
|    1                  Not Supported                                         |
|    2                  Not Supported                                         |
|    3                  Not Supported                                         |
+-----------------------------------------------------------------------------+
jaycedowell commented 6 years ago

Can you run other CUDA codes, like something from the CUDA code samples library, without error?

Leeying1235 commented 6 years ago

@jaycedowell It installed fine though. I went into the samples and found 1_Utilities/deviceQuery to make a test. It compiled fine. Then, this happened: $ ./deviceQuery ./deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) cudaGetDeviceCount returned 30 -> CUDA driver version is insufficient for CUDA runtime version Result = FAIL

jaycedowell commented 6 years ago

It looks like there is some mis-match between the kernel driver version and what CUDA expects. If you can get this resolved and deviceQuery to run without failing, I suspect that bifrost will stop throwing BF_STATUS_DEVICE_ERROR on import.

Leeying1235 commented 6 years ago

@jaycedowell I tried to reinstall cuda and run deviceQuery without failing on another computer. But when i recompile bifrost to install, some other questions happened:

lw@lw-HP-ENVY-Notebook:~/desktop/bifrost-master$ sudo make -j
make -C src all
make[1]: Entering directory '/home/lw/desktop/bifrost-master/src'
/bin/sh: 1: nvcc: not found
/bin/sh: 1: cuobjdump: not found
/bin/sh: 1: nvcc: not foundinalg_kernels.cu
Makefile:163: recipe for target '_cuda_device_link.o' failed
make[1]: *** [_cuda_device_link.o] Error 127
make[1]: *** 正在等待未完成的任务....
/bin/sh: 1: nvcc: not found
autodep.mk:56: recipe for target 'guantize.o' failed
make[1]: *** [guantize.o] Error 127
/bin/sh: 1: nvcc: not found
autodep.mk:56: recipe for target 'gunpack.o' failed
make[1]: *** [gunpack.o] Error 127
/bin/sh: 1: nvcc: not found
autodep.mk:56: recipe for target 'transpose.o' failed
make[1]: *** [transpose.o] Error 127
/bin/sh: 1: nvcc: not found
autodep.mk:56: recipe for target 'reduce.o' failed
make[1]: *** [reduce.o] Error 127
/bin/sh: 1: nvcc: not found
autodep.mk:56: recipe for target 'fdmt.o' failed
make[1]: *** [fdmt.o] Error 127
/bin/sh: 1: nvcc: not found
autodep.mk:56: recipe for target 'linalg.o' failed
make[1]: *** [linalg.o] Error 127
/bin/sh: 1: nvcc: not found
autodep.mk:56: recipe for target 'linalg_kernels.o' failed
make[1]: *** [linalg_kernels.o] Error 127
/bin/sh: 1: nvcc: not found
autodep.mk:56: recipe for target 'fir.o' failed
make[1]: *** [fir.o] Error 127
/bin/sh: 1: nvcc: not found
autodep.mk:56: recipe for target 'fft.o' failed
make[1]: *** [fft.o] Error 127
make[1]: Leaving directory '/home/lw/desktop/bifrost-master/src'
Makefile:15: recipe for target 'libbifrost' failed
make: *** [libbifrost] Error 2
jaycedowell commented 6 years ago

This looks like a path problem. You probably need to set CUDA_HOME in user.mk to the installation path for CUDA and check your CUDA_ROOT and CUDA_INC_DIR environment variables.

Leeying1235 commented 6 years ago

@jaycedowell yes,you are right. I figured it out by adding variables CUDA_ROOT and CUDA_INC_DIR.

Leeying1235 commented 6 years ago

@jaycedowell @telegraphic I got the issue meets before solved by reinstall cuda and cuda drivers. I think the main reason is the mis-match of cuda drivers and @jaycedowell you are right. @telegraphic @jaycedowell Thank you for your help! And i will close this issue.