Closed loretoparisi closed 7 years ago
I have same issue, any solution ???
You have other applications which occupies the gpu memory. You has a 4G gpu, but it has been used 3.5G
@TonyChouZJU you are right: 3554MiB / 4036MiB
. I guess this OOM is due to lack of gpu mem only for exhaust memory allocation, since I do not have in this docker instance any other running process on this gpu. When in tensorflow
there is a way to limit the gpu memory allocation (it's a well known behavior/problem that TF tries to allocate as much memory as possibile) using the per_process_gpu_memory_fraction
, etc.
I wonder if the @pjreddie darknet
could do something similar.
@Abduoit were you using a 4GB GPU as well?
@loretoparisi no i was using nvidia quadro k600, that has 1 GB GPU. I ll try to use it with nvidia Gforce 1080, 8 GB GPU, I will let u know
@Abduoit thanks that would be a great test to understand if it's a OOM related issue!
I'm having the same issue. I have a 1060 with 6GB of memory.
@Abduoit , have you solved the issue you had ?
I have a same issue. I have spent almost 2days to solve this issue, but I couldn't. :( I'm working on AWS EC2 P2 instance - ubuntu 16.04 and tesla K80.
@Labyrins which is your K80 memory usage (nvidia-smi
output)?
@loretoparisi At the start on the training, memory usage was not high(maybe 10~20% of available memory). As training is going on, memory usage rises quickly and overflows.
Anyway, I could handle this issue in my case. I'm not sure whether my way works on other guy's cases. In my case, there are some modified to prevent out of memory.
Change ARCH in Makefile
as your GPU
https://github.com/pjreddie/darknet/blob/1e729804f61c8627eb257fba8b83f74e04945db7/Makefile#L7
I used Tesla K80. As this Guide, I changed the arch setting.
ARCH= -gencode arch=compute_37, code=sm_37
//for K80
Adjust batch and subdivision in your cfg file. These options have a great effect on memory overflow. Generally lower value lower memory. Find your optimal value. https://github.com/pjreddie/darknet/blob/1e729804f61c8627eb257fba8b83f74e04945db7/cfg/tiny.cfg#L2
By above changes, I could avoid memory issue. I hope this comment helps you guys.
FYI. I did not use OPENCV.
@Labyrins so it's not only a problem of the GPU memory, but the architecture as well, as it is specified by the ARCH
settings in the Makefile.
@loretoparisi Yes, check architecture setting as your GPU. also your cfg file.
Hey guys had the same issue, I ran it with optirun and it worked. Example: optirun ./darknet detect cfg/yolo.cfg yolo.weights data/dog.jpg
@dmarkov00 having opencv
installed? what optirun
does?
@loretoparisi I ran it without opencv. As long as I understood optirun calls/points to the nvidia driver/interface or makes sure the cuda thing is executed(not exactly sure).
@Labyrins thank you! I have followed your hints:
I have set the arch to the K80 now (I have moved to this) and activate CUDNN as well:
GPU=1
CUDNN=1
OPENCV=0
OPENMP=0
DEBUG=0
# choose arch here: https://developer.nvidia.com/cuda-gpus
#ARCH= -gencode arch=compute_20,code=[sm_20,sm_21] \
-gencode arch=compute_30,code=sm_30 \
-gencode arch=compute_35,code=sm_35 \
-gencode arch=compute_50,code=[sm_50,compute_50] \
-gencode arch=compute_52,code=[sm_52,compute_52]
# Tesla K80
ARCH= -gencode arch=compute_37,code=[sm_37]
This RNN config where I have changed batch
from 128 to 64:
[net]
subdivisions=1
inputs=256
batch = 64
momentum=0.9
decay=0.001
max_batches = 2000
time_steps=576
learning_rate=0.1
policy=steps
steps=1000,1500
scales=.1,.1
[rnn]
batch_normalize=1
output = 1024
hidden=1024
activation=leaky
[rnn]
batch_normalize=1
output = 1024
hidden=1024
activation=leaky
[rnn]
batch_normalize=1
output = 1024
hidden=1024
activation=leaky
[connected]
output=256
activation=leaky
[softmax]
[cost]
type=sse
where I have
# nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.66 Driver Version: 384.66 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:00:1E.0 Off | 0 |
| N/A 81C P0 122W / 149W | 7642MiB / 11439MiB | 97% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 9640 C ./darknet 5667MiB |
| 0 27214 C python3 1962MiB |
and
# ./darknet rnn train cfg/rnn.train.cfg -file ./t8.shakespeare.txt
rnn
layer filters size input output
0 RNN Layer: 256 inputs, 1024 outputs
connected 256 -> 1024
connected 1024 -> 1024
connected 1024 -> 1024
Unused field: 'hidden = 1024'
1 RNN Layer: 1024 inputs, 1024 outputs
connected 1024 -> 1024
connected 1024 -> 1024
connected 1024 -> 1024
Unused field: 'hidden = 1024'
2 RNN Layer: 1024 inputs, 1024 outputs
connected 1024 -> 1024
connected 1024 -> 1024
connected 1024 -> 1024
Unused field: 'hidden = 1024'
3 connected 1024 -> 256
4 softmax 256
5 cost 256
Learning Rate: 0.1, Momentum: 0.9, Decay: 0.001, Inputs: 256 36864 576
1: 0.993232, 0.993232 avg, 0.100000 rate, 5.930198 seconds, 0.006754 epochs
2: 0.982678, 0.992177 avg, 0.100000 rate, 5.314698 seconds, 0.013508 epochs
3: 0.956013, 0.988560 avg, 0.100000 rate, 5.327953 seconds, 0.020262 epochs
4: 1.024190, 0.992123 avg, 0.100000 rate, 5.323785 seconds, 0.027016 epochs
So it seems to work now!
@Labyrins hey I have a strange output. Basically the rnn generate command:
./darknet rnn generate cfg/rnn.cfg ./my_model.weights -srand 0 -len 1000 -seed love
returns part of the text (exact text, i.e. not a new generated text) starting from the seed
.
My training command was like
./darknet rnn train cfg/rnn.train.cfg -file /root/my_text_dataset.txt
I have used the rnn.train.cfg
configuration above, with batch=64
and steps=1000,1500
.
Btw opened a new ticket here
I just encountered this problem and solved it. My solution was
sudo rm -rf ~/.nv
and then reboot.
@peterbyzhang do you mean the assertion failed wit the rnn train
? I'm not sure if you need to reboot, maybe you can just reset the gpu: sudo nvidia-smi --gpu-reset -i 0
instead of doing a reboot.
As I played with darknet, from time to time this .nv folder would be created in /home/, and not only would darknet fail, but pytorch would report some kind of cuda internal failer as well. I was running 'detector test', the detection command for darknet. This problem is not restricted to rnn train, it seems to be a general problem with the gpu. Let me know if removing .nv works for you.
Maybe this will help you: run the executable as root or sudo ./darknet [your parameters] I was inspired by this article: https://devtalk.nvidia.com/default/topic/760872/ubuntu-12-04-error-cudagetdevicecount-returned-30/ The reason may be: GPU is the device which can only be used by root.
@peterbyzhang
GTX1050(2G)+ubuntu16.04
I just solved this problem by your answer! sudo rm -rf ~/.nv
works.Thx
A restart and adjustment of batch (=64) did it for me. However my model is still training, hope it's working.
I am trying to train my own dataset and cuda 8.0 giving error
CUDA Error: out of memory darknet: ./src/cuda.c:36: check_error: Assertion `0' failed. Aborted (core dumped)
and nvidia details are
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 384.130 Driver Version: 384.130 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Quadro 4000 Off | 00000000:09:00.0 On | N/A | | 40% 51C P12 N/A / N/A | 242MiB / 1977MiB | 1% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla K40c Off | 00000000:22:00.0 Off | 0 | | 23% 42C P8 24W / 235W | 1MiB / 11439MiB | 0% Default | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
I'm having the same problem. I'm getting the darknet: ./src/cuda.c:36: check_error: Assertion `0' failed. when using batch greater than 4. When using batch=4 (or lower) and subdivision=1, the model loads but then immediately saves the final weights, so it actually does not do anything. (this is weird???) The log in that case is:
Changing batch size (from 64 to 32) solved the issue on GTX 1060 (Ubuntu 16.04), on yolov3.cfg file.
I found this note here https://github.com/AlexeyAB/darknet Note: if error Out of memory occurs then in .cfg-file you should increase subdivisions=16, 32 or 64
Hi, I am trying to train a model on AWS EC2.
I am able to compile and build darknet code with GPU=1 and CUDNN=1.
But when I start training model by giving "./darknet detector train <.data file> <.cfg file>
CUDA Error: no CUDA-capable device is detected darknet: ./src/cuda.c:36: check_error: Assertion `0' failed. Aborted (core dumped)
Though the above error says, that it doesn't have CUDA but when I check CUDA version by typing command "nvcc --version", it gives me the following : nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2017 NVIDIA Corporation Built on Fri_Sep__1_21:08:03_CDT_2017 Cuda compilation tools, release 9.0, V9.0.176
Could someone help me here. I will be very grateful for your help.
@Shivesh4680
It seems you use Amazon-instance without GPU. Can you show output of nvidia-smi
command?
Try to choose instance: p3.2xlarge
(Tesla V100)
Follow this manual - how to train Yolo on Amazon EC2: https://github.com/AlexeyAB/darknet/issues/1380#issuecomment-412333942
Hi @AlexeyAB ,
When I type command "nvidia-cmi", I get following output :
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
@Shivesh4680 So, there is no nVidia GPU Driver, and most likely there is no GPU. Chose other instances: https://github.com/AlexeyAB/darknet/issues/1380#issuecomment-412333942
Thanks @AlexeyAB . It worked.
Could you help me ? How to train a model on YUV images ? I think, darknet by default train on RGB images. I have RGB images and I want to convert them to YUV and train a model on it.
For anyone training on AWS see @loretoparisi's answer. You will most likely need to change both the makefile
and the batch
/subdivisions
parameter in yolov3.cfg
file. The number of images will not fit on the GPU otherwise.
export CUDA_CACHE_PATH=/tmp/cuda_mem
Solved for me.
Reduce your batch size. I used yolov3.cfg with batch = 24 instead of batch = 64 and it solved my problem.
Mostly you run out of memory because your batch size is high.
I had the same issue while training SyntaxNet using Tensorflow and the solution for that problem was the same as this one.
Maybe your batch is set to exceed the maximum batch that your GPU can accept. The first time I used batch=128 and subdivision=64, then I reported an "out of memory" error. My nvidia-smi shows 689/4096 (my GPU is gtx1050ti). When I batched the setting in cfg/yolov3-voc.cfg to 64, it worked fine.
Maybe your batch is set to exceed the maximum batch that your GPU can accept. The first time I used batch=128 and subdivision=64, then I reported an "out of memory" error. My nvidia-smi shows 689/4096 (my GPU is gtx1050ti). When I batched the setting in cfg/yolov3-voc.cfg to 64, it worked fine.
How does your setting look like? I also have a GPU 1050ti, NVIDIA is installed and nothing seems to work.
I'm getting following error. It was running perfectly fine few day back, after I updated my ubuntu softwares, it started giving following error.
cuDNN status Error in: file: ./src/convolutional_layer.c : () : line: 301 : build time: Dec 3 2019 - 15:20:25 cuDNN Error: CUDNN_STATUS_BAD_PARAM cuDNN Error: CUDNN_STATUS_BAD_PARAM: File exists darknet: ./src/utils.c:293: error: Assertion `0' failed. Aborted (core dumped)
Suggestions?
I'm getting following error. It was running perfectly fine few day back, after I updated my ubuntu softwares, it started giving following error.
cuDNN status Error in: file: ./src/convolutional_layer.c : () : line: 301 : build time: Dec 3 2019 - 15:20:25 cuDNN Error: CUDNN_STATUS_BAD_PARAM cuDNN Error: CUDNN_STATUS_BAD_PARAM: File exists darknet: ./src/utils.c:293: error: Assertion `0' failed. Aborted (core dumped)
Suggestions?
do you have the solution now? i'm facing the same problem
I was the same issue and I decreased the batch size from 128 to 16. The higher the batch count, the more memory required.
I'd suggest trying an iterative approach an see what number works for them.
while running darknet I am getting following error, can any one help me to get out of this
C:\darknet-master\build\darknet\x64>darknet.exe detect cfg/yolov3.cfg yolov3.weights data/dog.jpg CUDA-version: 10010 (10010), cuDNN: 7.6.5, CUDNN_HALF=1, GPU count: 1 CUDNN_HALF=1 OpenCV version: 4.2.0 0 : compute_capability = 750, cudnn_half = 1, GPU: GeForce GTX 1650 net.optimized_memory = 0 mini_batch = 1, batch = 1, time_steps = 1, train = 0 layer filters size/strd(dil) input output 0 cuDNN status Error in: file: C:\darknet-master\src\dark_cuda.c : cudnn_handle() : line: 171 : build time: May 18 2020 - 15:11:05
cuDNN Error: CUDNN_STATUS_BAD_PARAM
Anyway, I could handle this issue in my case. I'm not sure whether my way works on other guy's cases. In my case, there are some modified to prevent out of memory.
Change ARCH in
Makefile
as your GPU https://github.com/pjreddie/darknet/blob/1e729804f61c8627eb257fba8b83f74e04945db7/Makefile#L7I used Tesla K80. As this Guide, I changed the arch setting.
ARCH= -gencode arch=compute_37, code=sm_37
//for K80- Adjust batch and subdivision in your cfg file. These options have a great effect on memory overflow. Generally lower value lower memory. Find your optimal value. https://github.com/pjreddie/darknet/blob/1e729804f61c8627eb257fba8b83f74e04945db7/cfg/tiny.cfg#L2
By above changes, I could avoid memory issue. I hope this comment helps you guys.
FYI. I did not use OPENCV.
thank you , i modify the batch field to 16 and it successed in GTX1050ti 4GB
darknet: ./src/cuda.c:36: check_error: Assertion `0' failed. Learning Rate: 0.001, Momentum: 0.9, Decay: 0.0005 Resizing 384 Loaded: 0.000037 seconds CUDA Error: invalid device symbol
I am trying to run the training on docker image with ubuntu 16.04 cuda 9.2 and in my base machine ubuntu 20.04 and Cuda 11.4 is available I have NVIDIA RTX A6000
Any suggestion on how can I resolve this error
@AlexeyAB shashivardhan@Shashivardhan:~/yolo-9000/darknet$ ./darknet detector test cfg/combine9k.data cfg/yolo9000.cfg ../yolo9000-weights/yolo9000.weights data/horses.jpg layer filters size input output 0 CUDA Error: the provided PTX was compiled with an unsupported toolchain. darknet: ./src/cuda.c:36: check_error: Assertion `0' failed. Aborted
how to solve this
I'm running
cuda8.0
onUbuntu16.04LTS
.