pjreddie / darknet

Convolutional Neural Networks
http://pjreddie.com/darknet/
Other
25.75k stars 21.33k forks source link

rnn: ./src/cuda.c:36: check_error: Assertion `0' failed. #98

Closed loretoparisi closed 7 years ago

loretoparisi commented 7 years ago

I'm running cuda8.0 on Ubuntu16.04LTS.

root@29c797bc416c:~/darknet# ./darknet rnn train cfg/rnn.train.cfg -file ./t8.shakespeare.txt 
rnn
layer     filters    size              input                output
    0 RNN Layer: 256 inputs, 1024 outputs
        CUDA Error: out of memory
darknet: ./src/cuda.c:36: check_error: Assertion `0' failed.
Aborted (core dumped)
root@29c797bc416c:~/darknet# nvidia-smi
Fri Jul 21 14:11:16 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57                 Driver Version: 367.57                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GRID K520           Off  | 0000:00:03.0     Off |                  N/A |
| N/A   62C    P0   104W / 125W |   3554MiB /  4036MiB |     92%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+
Abduoit commented 7 years ago

I have same issue, any solution ???

TonyChouZJU commented 7 years ago

You have other applications which occupies the gpu memory. You has a 4G gpu, but it has been used 3.5G

loretoparisi commented 7 years ago

@TonyChouZJU you are right: 3554MiB / 4036MiB. I guess this OOM is due to lack of gpu mem only for exhaust memory allocation, since I do not have in this docker instance any other running process on this gpu. When in tensorflow there is a way to limit the gpu memory allocation (it's a well known behavior/problem that TF tries to allocate as much memory as possibile) using the per_process_gpu_memory_fraction, etc. I wonder if the @pjreddie darknet could do something similar.

loretoparisi commented 7 years ago

@Abduoit were you using a 4GB GPU as well?

Abduoit commented 7 years ago

@loretoparisi no i was using nvidia quadro k600, that has 1 GB GPU. I ll try to use it with nvidia Gforce 1080, 8 GB GPU, I will let u know

loretoparisi commented 7 years ago

@Abduoit thanks that would be a great test to understand if it's a OOM related issue!

desireB commented 7 years ago

I'm having the same issue. I have a 1060 with 6GB of memory.

desireB commented 7 years ago

@Abduoit , have you solved the issue you had ?

Labyrins commented 7 years ago

I have a same issue. I have spent almost 2days to solve this issue, but I couldn't. :( I'm working on AWS EC2 P2 instance - ubuntu 16.04 and tesla K80.

loretoparisi commented 7 years ago

@Labyrins which is your K80 memory usage (nvidia-smi output)?

Labyrins commented 7 years ago

@loretoparisi At the start on the training, memory usage was not high(maybe 10~20% of available memory). As training is going on, memory usage rises quickly and overflows.

Labyrins commented 7 years ago

Anyway, I could handle this issue in my case. I'm not sure whether my way works on other guy's cases. In my case, there are some modified to prevent out of memory.

By above changes, I could avoid memory issue. I hope this comment helps you guys.

FYI. I did not use OPENCV.

loretoparisi commented 7 years ago

@Labyrins so it's not only a problem of the GPU memory, but the architecture as well, as it is specified by the ARCH settings in the Makefile.

Labyrins commented 7 years ago

@loretoparisi Yes, check architecture setting as your GPU. also your cfg file.

dmarkov00 commented 7 years ago

Hey guys had the same issue, I ran it with optirun and it worked. Example: optirun ./darknet detect cfg/yolo.cfg yolo.weights data/dog.jpg

loretoparisi commented 7 years ago

@dmarkov00 having opencv installed? what optirun does?

dmarkov00 commented 7 years ago

@loretoparisi I ran it without opencv. As long as I understood optirun calls/points to the nvidia driver/interface or makes sure the cuda thing is executed(not exactly sure).

loretoparisi commented 7 years ago

@Labyrins thank you! I have followed your hints:

I have set the arch to the K80 now (I have moved to this) and activate CUDNN as well:

GPU=1
CUDNN=1
OPENCV=0
OPENMP=0
DEBUG=0

# choose arch here: https://developer.nvidia.com/cuda-gpus
#ARCH= -gencode arch=compute_20,code=[sm_20,sm_21] \
      -gencode arch=compute_30,code=sm_30 \
      -gencode arch=compute_35,code=sm_35 \
      -gencode arch=compute_50,code=[sm_50,compute_50] \
      -gencode arch=compute_52,code=[sm_52,compute_52]

# Tesla K80
ARCH= -gencode arch=compute_37,code=[sm_37]

This RNN config where I have changed batch from 128 to 64:

[net]
subdivisions=1
inputs=256
batch = 64
momentum=0.9
decay=0.001
max_batches = 2000
time_steps=576
learning_rate=0.1
policy=steps
steps=1000,1500
scales=.1,.1

[rnn]
batch_normalize=1
output = 1024
hidden=1024
activation=leaky

[rnn]
batch_normalize=1
output = 1024
hidden=1024
activation=leaky

[rnn]
batch_normalize=1
output = 1024
hidden=1024
activation=leaky

[connected]
output=256
activation=leaky

[softmax]

[cost]
type=sse

where I have

# nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.66                 Driver Version: 384.66                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   81C    P0   122W / 149W |   7642MiB / 11439MiB |     97%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      9640    C   ./darknet                                     5667MiB |
|    0     27214    C   python3                                       1962MiB |

and


# ./darknet rnn train cfg/rnn.train.cfg -file ./t8.shakespeare.txt 
rnn
layer     filters    size              input                output
    0 RNN Layer: 256 inputs, 1024 outputs
        connected                             256  ->  1024
        connected                            1024  ->  1024
        connected                            1024  ->  1024
Unused field: 'hidden = 1024'
    1 RNN Layer: 1024 inputs, 1024 outputs
        connected                            1024  ->  1024
        connected                            1024  ->  1024
        connected                            1024  ->  1024
Unused field: 'hidden = 1024'
    2 RNN Layer: 1024 inputs, 1024 outputs
        connected                            1024  ->  1024
        connected                            1024  ->  1024
        connected                            1024  ->  1024
Unused field: 'hidden = 1024'
    3 connected                            1024  ->   256
    4 softmax                                         256
    5 cost                                            256
Learning Rate: 0.1, Momentum: 0.9, Decay: 0.001, Inputs: 256 36864 576
1: 0.993232, 0.993232 avg, 0.100000 rate, 5.930198 seconds, 0.006754 epochs
2: 0.982678, 0.992177 avg, 0.100000 rate, 5.314698 seconds, 0.013508 epochs
3: 0.956013, 0.988560 avg, 0.100000 rate, 5.327953 seconds, 0.020262 epochs
4: 1.024190, 0.992123 avg, 0.100000 rate, 5.323785 seconds, 0.027016 epochs

So it seems to work now!
loretoparisi commented 7 years ago

@Labyrins hey I have a strange output. Basically the rnn generate command:

./darknet rnn generate cfg/rnn.cfg ./my_model.weights -srand 0 -len 1000 -seed love

returns part of the text (exact text, i.e. not a new generated text) starting from the seed.

My training command was like

./darknet rnn train cfg/rnn.train.cfg -file /root/my_text_dataset.txt

I have used the rnn.train.cfg configuration above, with batch=64 and steps=1000,1500.

Btw opened a new ticket here

peterbyzhang commented 6 years ago

I just encountered this problem and solved it. My solution was

sudo rm -rf ~/.nv

and then reboot.

loretoparisi commented 6 years ago

@peterbyzhang do you mean the assertion failed wit the rnn train? I'm not sure if you need to reboot, maybe you can just reset the gpu: sudo nvidia-smi --gpu-reset -i 0 instead of doing a reboot.

peterbyzhang commented 6 years ago

As I played with darknet, from time to time this .nv folder would be created in /home/, and not only would darknet fail, but pytorch would report some kind of cuda internal failer as well. I was running 'detector test', the detection command for darknet. This problem is not restricted to rnn train, it seems to be a general problem with the gpu. Let me know if removing .nv works for you.

icyhearts commented 6 years ago

Maybe this will help you: run the executable as root or sudo ./darknet [your parameters] I was inspired by this article: https://devtalk.nvidia.com/default/topic/760872/ubuntu-12-04-error-cudagetdevicecount-returned-30/ The reason may be: GPU is the device which can only be used by root.

harrysimply commented 6 years ago

@peterbyzhang

GTX1050(2G)+ubuntu16.04

I just solved this problem by your answer! sudo rm -rf ~/.nvworks.Thx

affian commented 6 years ago

A restart and adjustment of batch (=64) did it for me. However my model is still training, hope it's working.

pranavssgit commented 6 years ago

I am trying to train my own dataset and cuda 8.0 giving error

CUDA Error: out of memory darknet: ./src/cuda.c:36: check_error: Assertion `0' failed. Aborted (core dumped)

and nvidia details are

+-----------------------------------------------------------------------------+ | NVIDIA-SMI 384.130 Driver Version: 384.130 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Quadro 4000 Off | 00000000:09:00.0 On | N/A | | 40% 51C P12 N/A / N/A | 242MiB / 1977MiB | 1% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla K40c Off | 00000000:22:00.0 Off | 0 | | 23% 42C P8 24W / 235W | 1MiB / 11439MiB | 0% Default | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

pmusta commented 6 years ago

EDIT: Found solution: I had old version of cuda drivers (384.130), updated to 390 and this started working :)

I'm having the same problem. I'm getting the darknet: ./src/cuda.c:36: check_error: Assertion `0' failed. when using batch greater than 4. When using batch=4 (or lower) and subdivision=1, the model loads but then immediately saves the final weights, so it actually does not do anything. (this is weird???) The log in that case is:

Learning Rate: 0.004, Momentum: 0.9, Decay: 0.0005 ... and then nothing (when its working it should continue with Resizing 608 Loaded: 0.000053 seconds etc.) I tried both with 608 and 416 in width=416 height=416 in the cfg. I tried changing the arch in Makefile to: ARCH= -gencode arch=compute_61,code=sm_61 -gencode arch=compute_61,code=compute_61 (this was in the Makefile in the other version of darknet at https://github.com/AlexeyAB/darknet/blob/master/Makefile) tried doing `sudo rm -rf ~/.nv` Tried with CUDNN and OPENCV and without them. No difference. tried running with sudo Cuda 9.0 with my own dataset, GeForce GTX 1080, Ubuntu 16.04, yolov3. I have managed to run this on other server with Tesla GPU I think (using same data)? This was about 4 months ago, so its either some change in darknet, the different gpu model or i'm screwing something up this time without knowing :) anyway, any help much appreciated! nvidia-smi gives: | NVIDIA-SMI 384.130 Driver Version: 384.130 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 1080 Off | 00000000:01:00.0 Off | N/A | | 0% 42C P0 36W / 180W | 0MiB / 8113MiB | 2% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found
automata commented 5 years ago

Changing batch size (from 64 to 32) solved the issue on GTX 1060 (Ubuntu 16.04), on yolov3.cfg file.

SonaHarutyunyan commented 5 years ago

I found this note here https://github.com/AlexeyAB/darknet Note: if error Out of memory occurs then in .cfg-file you should increase subdivisions=16, 32 or 64

Shivesh4680 commented 5 years ago

Hi, I am trying to train a model on AWS EC2.

I am able to compile and build darknet code with GPU=1 and CUDNN=1.

But when I start training model by giving "./darknet detector train <.data file> <.cfg file> " command , I get following error.

CUDA Error: no CUDA-capable device is detected darknet: ./src/cuda.c:36: check_error: Assertion `0' failed. Aborted (core dumped)

Though the above error says, that it doesn't have CUDA but when I check CUDA version by typing command "nvcc --version", it gives me the following : nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2017 NVIDIA Corporation Built on Fri_Sep__1_21:08:03_CDT_2017 Cuda compilation tools, release 9.0, V9.0.176

Could someone help me here. I will be very grateful for your help.

AlexeyAB commented 5 years ago

@Shivesh4680

It seems you use Amazon-instance without GPU. Can you show output of nvidia-smi command? Try to choose instance: p3.2xlarge (Tesla V100)

Follow this manual - how to train Yolo on Amazon EC2: https://github.com/AlexeyAB/darknet/issues/1380#issuecomment-412333942

Shivesh4680 commented 5 years ago

Hi @AlexeyAB ,

When I type command "nvidia-cmi", I get following output :

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

AlexeyAB commented 5 years ago

@Shivesh4680 So, there is no nVidia GPU Driver, and most likely there is no GPU. Chose other instances: https://github.com/AlexeyAB/darknet/issues/1380#issuecomment-412333942

Shivesh4680 commented 5 years ago

Thanks @AlexeyAB . It worked.

Could you help me ? How to train a model on YUV images ? I think, darknet by default train on RGB images. I have RGB images and I want to convert them to YUV and train a model on it.

sachinruk commented 5 years ago

For anyone training on AWS see @loretoparisi's answer. You will most likely need to change both the makefile and the batch/subdivisions parameter in yolov3.cfg file. The number of images will not fit on the GPU otherwise.

dbersan commented 5 years ago

export CUDA_CACHE_PATH=/tmp/cuda_mem

Solved for me.

kavianhabib commented 5 years ago

Reduce your batch size. I used yolov3.cfg with batch = 24 instead of batch = 64 and it solved my problem.

Mostly you run out of memory because your batch size is high.

I had the same issue while training SyntaxNet using Tensorflow and the solution for that problem was the same as this one.

AlanZxx commented 5 years ago

Maybe your batch is set to exceed the maximum batch that your GPU can accept. The first time I used batch=128 and subdivision=64, then I reported an "out of memory" error. My nvidia-smi shows 689/4096 (my GPU is gtx1050ti). When I batched the setting in cfg/yolov3-voc.cfg to 64, it worked fine.

ghost commented 4 years ago

Maybe your batch is set to exceed the maximum batch that your GPU can accept. The first time I used batch=128 and subdivision=64, then I reported an "out of memory" error. My nvidia-smi shows 689/4096 (my GPU is gtx1050ti). When I batched the setting in cfg/yolov3-voc.cfg to 64, it worked fine.

How does your setting look like? I also have a GPU 1050ti, NVIDIA is installed and nothing seems to work.

Dikshantbisht13 commented 4 years ago

I'm getting following error. It was running perfectly fine few day back, after I updated my ubuntu softwares, it started giving following error.

cuDNN status Error in: file: ./src/convolutional_layer.c : () : line: 301 : build time: Dec 3 2019 - 15:20:25 cuDNN Error: CUDNN_STATUS_BAD_PARAM cuDNN Error: CUDNN_STATUS_BAD_PARAM: File exists darknet: ./src/utils.c:293: error: Assertion `0' failed. Aborted (core dumped)

Suggestions?

DynamicCodes commented 4 years ago

I'm getting following error. It was running perfectly fine few day back, after I updated my ubuntu softwares, it started giving following error.

cuDNN status Error in: file: ./src/convolutional_layer.c : () : line: 301 : build time: Dec 3 2019 - 15:20:25 cuDNN Error: CUDNN_STATUS_BAD_PARAM cuDNN Error: CUDNN_STATUS_BAD_PARAM: File exists darknet: ./src/utils.c:293: error: Assertion `0' failed. Aborted (core dumped)

Suggestions?

do you have the solution now? i'm facing the same problem

guptavasu1213 commented 4 years ago

I was the same issue and I decreased the batch size from 128 to 16. The higher the batch count, the more memory required.

I'd suggest trying an iterative approach an see what number works for them.

nisarggandhewar commented 4 years ago

while running darknet I am getting following error, can any one help me to get out of this

C:\darknet-master\build\darknet\x64>darknet.exe detect cfg/yolov3.cfg yolov3.weights data/dog.jpg CUDA-version: 10010 (10010), cuDNN: 7.6.5, CUDNN_HALF=1, GPU count: 1 CUDNN_HALF=1 OpenCV version: 4.2.0 0 : compute_capability = 750, cudnn_half = 1, GPU: GeForce GTX 1650 net.optimized_memory = 0 mini_batch = 1, batch = 1, time_steps = 1, train = 0 layer filters size/strd(dil) input output 0 cuDNN status Error in: file: C:\darknet-master\src\dark_cuda.c : cudnn_handle() : line: 171 : build time: May 18 2020 - 15:11:05

cuDNN Error: CUDNN_STATUS_BAD_PARAM

zpengc commented 3 years ago

Anyway, I could handle this issue in my case. I'm not sure whether my way works on other guy's cases. In my case, there are some modified to prevent out of memory.

By above changes, I could avoid memory issue. I hope this comment helps you guys.

FYI. I did not use OPENCV.

thank you , i modify the batch field to 16 and it successed in GTX1050ti 4GB

nikhilkhandelwal0702 commented 3 years ago

darknet: ./src/cuda.c:36: check_error: Assertion `0' failed. Learning Rate: 0.001, Momentum: 0.9, Decay: 0.0005 Resizing 384 Loaded: 0.000037 seconds CUDA Error: invalid device symbol

I am trying to run the training on docker image with ubuntu 16.04 cuda 9.2 and in my base machine ubuntu 20.04 and Cuda 11.4 is available I have NVIDIA RTX A6000

Any suggestion on how can I resolve this error

Shashi630 commented 9 months ago

@AlexeyAB shashivardhan@Shashivardhan:~/yolo-9000/darknet$ ./darknet detector test cfg/combine9k.data cfg/yolo9000.cfg ../yolo9000-weights/yolo9000.weights data/horses.jpg layer filters size input output 0 CUDA Error: the provided PTX was compiled with an unsupported toolchain. darknet: ./src/cuda.c:36: check_error: Assertion `0' failed. Aborted

how to solve this