tadam98 commented 2 years ago

The attached is not working even after corrections. (See my email)

sanyam83 commented 2 years ago

Hello, The issue is rectified now and other errors are also rectified, please refer to the notebooks again. In your case, the folder name "tools" is clashing with another folder with the same name, Hence the error. Try renaming the folder, you will be able to successfully import then.

For example - After renaming the folder,
from tools_deepsort import generate_detections as gdet

tadam98 commented 2 years ago

Hi,

You have removed:

from google.colab import drive
drive.mount('/content/gdrive')

So I guess it is planned to run on my local GPU-Ubuntu-18.04. I downloaded the complete folder of https://github.com/spmallick/learnopencv/tree/master/ALPR to my local GPU-Ubuntu-18.04. Started my conda environment that has what is needed (including the requirememts.txt). With the new ALPR_inference.ipynb the first 7 steps work as documented. Step [8], first step of the Detector section fails.

%cd ./darknet fails. There is a darknet folder under ./License-plate-detection but I do not think it is the right one.

(Just to be sure, I did all the steps on colab. Same failure.) I am also wondering about the first OCR step of: %cd ../ which on colab changes the cwd to "/content" which is very unusual.

sanyam83 commented 2 years ago

If you are downloading the code from here, you need to set paths accordingly like ./License-plate-detection/darknet/ and the darknet folder under this is the right one. Otherwise, if cloning the darknet and other codes like shown in the notebook or the blog post, you will not face any errors.

tadam98 commented 2 years ago

Hi,

OK, I followed https://learnopencv.com/automatic-license-plate-recognition-using-deep-learning/?ck_subscriber_id=452195442

And decided to try training. Initial steps are fine.

Then, under Dataset, after "import math" three more imports are needed.

import math
import os
import matplotlib.image as image 
import matplotlib.pyplot as plt

Make sure the cdw is darknet %cd darknet Now the images show nice in the plt.

Under "Training" you skipped the location for data.names. Also the contents are wrong:

place in darknet/data

classes = 1
train = ./darknet/data/obj/train.txt
valid = ./darknet/data/obj/test.txt
names = /content/gdrive/MyDrive/yolov4-darknet/darknet/data/obj.names
backup = ./checkpoint

Change to:

classes = 1
train = ./data/obj/train.txt
valid = ./data/obj/test.txt
names = ./data/obj.names
backup = ./checkpoint

You had remains of colab and the path should not have "darknet" in it.

correct train.txt

the downloaded train.txt is for colab and has oto be updated from: /content/gdrive/My Drive/yolov4-darknet/darknet/data/obj/train/3fe012d7a03f9927.jpg to: ./data/obj/train/3fe012d7a03f9927.jpg

checkdir

You fogot to mention that "checkdir" should be created under darknet !mkdir checkpoint

yolo4.conv.137

You forgot to mention that yolov4.conv.137 should be under ./darknet

Now the traing command works: !./darknet detector train data/obj.data cfg/yolov4-obj.cfg yolov4.conv.137 -dont_show -map

It executed fine and then gave the message below: " Error: cuDNN isn't found FWD algo for convolution"

 Tensor Cores are disabled until the first 3000 iterations are reached.
 (next mAP calculation at 1000 iterations) 
 10: -nan, -nan avg loss, 0.000000 rate, 9.552273 seconds, 640 images, 6.814552 hours left
Resizing, random_coef = 1.40 

 512 x 512 
 Error: cuDNN isn't found FWD algo for convolution.

Tensor Cores are disabled until the first 3000 iterations are reached. (next mAP calculation at 1000 iterations) 10: -nan, -nan avg loss, 0.000000 rate, 9.552273 seconds, 640 images, 6.814552 hours left Resizing, random_coef = 1.40

512 x 512 Error: cuDNN isn't found FWD algo for convolution. ALPR_inference_my2.ipynb.txt

I have checked cudnn8 with the nvidia procedure:

cd cudnn_samples_v8
cd mnistCUDNN
make clean && make
./mnistCUDNN

Executing: mnistCUDNN cudnnGetVersion() : 8303 , CUDNN_VERSION from cudnn.h : 8303 (8.3.3) Host compiler version : GCC 9.4.0

There are 1 CUDA capable devices on your machine : device 0 : sms 68 Capabilities 7.5, SmClock 1650.0 Mhz, MemSize (Mb) 11263, MemClock 7000.0 Mhz, Ecc=0, boardGroupID=0 Using device 0 Resulting weights from Softmax: 0.0000000 0.0000008 0.0000000 0.0000002 0.0000000 1.0000000 0.0000154 0.0000000 0.0000012 0.0000006 Result of classification: 1 3 5 Test passed!

summary

I have followed the procedure for training. Made some corrections to make it start Something is not working. will wait a few hours as it appears to be running, but GPU load is at 1-2%. Sun Apr 3 23:22:57 2022

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.60.02    Driver Version: 512.15       CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:0A:00.0  On |                  N/A |
| 25%   33C    P8     7W / 260W |  10141MiB / 11264MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      4258      C   /darknet                        N/A      |
|    0   N/A  N/A     32718      C   /python3.7                      N/A      |
+-----------------------------------------------------------------------------+

Will any of these help:

https://stackoverflow.com/questions/53698035/failed-to-get-convolution-algorithm-this-is-probably-because-cudnn-failed-to-in

sanyam83 commented 2 years ago

For this error :

Check if CUDA and cuDNN are of the required versions. According to darknet requirements, CUDA >= 10.2 and cuDNN >= 8.0.2 should be installed.
Try increasing subdivisions value in yolov4-obj.cfg file to 32 or 64.

tadam98 commented 2 years ago

I have cuda 10.0 and 11.0. will check reducing subdivisions as suggested.

From: sanyam83 @.> Sent: Monday, April 4, 2022 10:18:26 AM To: spmallick/learnopencv @.> Cc: tadam98 @.>; Author @.> Subject: Re: [spmallick/learnopencv] learnopencv/ALPR/ is not working (Issue #658)

For this error :

Check if CUDA and cuDNN are of the required versions. According to darknethttps://github.com/AlexeyAB/darknet requirements, CUDA >= 10.2 and cuDNN >= 8.0.2 should be installed.
Try reducing subdivisions in yolov4-obj.cfg file to 32 or 64.

— Reply to this email directly, view it on GitHubhttps://github.com/spmallick/learnopencv/issues/658#issuecomment-1087202036, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFBMCTYWZ7BI5SEIVFTPS23VDKJUFANCNFSM5RMC43WQ. You are receiving this because you authored the thread.Message ID: @.***>

tadam98 commented 2 years ago

Increasing the value in ./darknet/cfg/yolov4-obj.cfg file to subdivisions=32 works ! It is running now. GPU memory use is now down to `6GB (from 10.5GB with subdivisions=16. GPU utilization is now 50-80%

Every 2.0s: nvidia-smi               MICKEY-2080TI-wsl: Mon Apr  4 14:04:41 2022
Mon Apr  4 14:04:42 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.60.02    Driver Version: 512.15       CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:0A:00.0  On |                  N/A |
| 55%   68C    P2   250W / 260W |   6104MiB / 11264MiB |     78%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     11842      C   /darknet                        N/A      |
|    0   N/A  N/A     32718      C   /python3.7                      N/A      |
+-----------------------------------------------------------------------------+

tadam98 commented 2 years ago

After 6h it is still at:

 Tensor Cores are disabled until the first 3000 iterations are reached.
 (next mAP calculation at 2900 iterations) 
 2820: 0.278743, 0.474088 avg loss, 0.001000 rate, 2.213002 seconds, 180480 images, 2.936276 hours left
Resizing, random_coef = 1.40 

 416 x 416 
 try to allocate additional workspace_size = 82.58 MB 
 CUDA allocate done!

no further progress can be seen.

I rebooted, changed subdivisions=32 and it ran for some time and got stuck again:

Tensor Cores are disabled until the first 3000 iterations are reached.

 544: 1.236107, 1.296011 avg loss, 0.000088 rate, 2.285857 seconds, 34816 images, 3.919907 hours left

I rebooted, cleaned ./darknet/checkpoint. It got stuck again at 3608 with loss=0.4.

sanyam83 commented 2 years ago

These errors are all because memory keeps running out, try increasing the subdivisions more.

tadam98 commented 2 years ago

Training still "dies" in the middle. I will try it with colab on the larger GPU just to see it training all the way to the end. I do not have a memory issue when changing to subdivisions=32/64. could be heating, but I saw no indication of this (an I have water cooling on the RTX 2080TI. I see 225W/265W in nvidia-smi).

tadam98 commented 2 years ago

Well, it died in the middle on colan with K80 GPU. I changed subdivisions to 32. Forget Colab - no chance to get a GPU for more that 30 minutes.

I am also compiling darknet to 2080 TI by uncommenting the correct row in the Makefile to check again on my 2080.

Got sutck here: (next mAP calculation at 3200 iterations)

Tensor Cores are used. Last accuracy mAP@0.50 = 67.72 %, best = 68.84 % 3177: 0.635285, 0.473156 avg loss, 0.001000 rate, 3.053653 seconds, 203328 images, 2.409412 hours left

Cant get the training completed. Software is "hard stuck" and does not respond to cntl/c. Any idea what can cause this? Machine has 64GB and 32 cores and RTX 2080 TI which was 50% of memory use.

Question: is there any log that can show anything?

tadam98 commented 2 years ago

I have successfully downloaded and compiled darknet on Windows 11, CUDA 11.6 and cuDNN 8.4 using the simple instructions of darknet readme.

CUDA-version: 11060 (11060), cuDNN: 8.4.0, GPU count: 1
OpenCV version: 4.5.5

Using the settings described hereinabove, based on the guidance in the Notebook, training was successfully completed:

Set -points flag:
 `-points 101` for MS COCO
 `-points 11` for PascalVOC 2007 (uncomment `difficult` in voc.data)
 `-points 0` (AUC) for ImageNet, PascalVOC 2010-2012, your custom dataset

 mean_average_precision (mAP@0.50) = 0.897842
Saving weights to ./checkpoint/yolov4-obj_6000.weights
Saving weights to ./checkpoint/yolov4-obj_last.weights
Saving weights to ./checkpoint/yolov4-obj_final.weights
If you want to train from the beginning, then use flag in the end of training command: -clear

It could be that my WSL2/Ubuntu 18.04 having CUDA 10.2 has some issues with darkent that used to get stuck. The underlying Windows 11 training is very suitable. (I am keeping the WSL2/Ubuntu with CUDA 10.2 chDNN 7.6.5 for Tensorflow 14).

So for now, all is good.

spmallick / learnopencv

learnopencv/ALPR/ is not working #658

place in darknet/data

Change to:

correct train.txt

checkdir

yolo4.conv.137

summary

Will any of these help: