Closed tadam98 closed 2 years ago
Hello, The issue is rectified now and other errors are also rectified, please refer to the notebooks again. In your case, the folder name "tools" is clashing with another folder with the same name, Hence the error. Try renaming the folder, you will be able to successfully import then.
For example - After renaming the folder,
from tools_deepsort import generate_detections as gdet
Hi,
You have removed:
from google.colab import drive
drive.mount('/content/gdrive')
So I guess it is planned to run on my local GPU-Ubuntu-18.04. I downloaded the complete folder of https://github.com/spmallick/learnopencv/tree/master/ALPR to my local GPU-Ubuntu-18.04. Started my conda environment that has what is needed (including the requirememts.txt). With the new ALPR_inference.ipynb the first 7 steps work as documented. Step [8], first step of the Detector section fails.
%cd ./darknet fails. There is a darknet folder under ./License-plate-detection but I do not think it is the right one.
(Just to be sure, I did all the steps on colab. Same failure.)
I am also wondering about the first OCR step of:
%cd ../
which on colab changes the cwd to "/content" which is very unusual.
If you are downloading the code from here, you need to set paths accordingly like ./License-plate-detection/darknet/
and the darknet folder under this is the right one. Otherwise, if cloning the darknet and other codes like shown in the notebook or the blog post, you will not face any errors.
Hi,
OK, I followed https://learnopencv.com/automatic-license-plate-recognition-using-deep-learning/?ck_subscriber_id=452195442
And decided to try training. Initial steps are fine.
Then, under Dataset, after "import math" three more imports are needed.
import math
import os
import matplotlib.image as image
import matplotlib.pyplot as plt
Make sure the cdw is darknet
%cd darknet
Now the images show nice in the plt.
Under "Training" you skipped the location for data.names. Also the contents are wrong:
classes = 1
train = ./darknet/data/obj/train.txt
valid = ./darknet/data/obj/test.txt
names = /content/gdrive/MyDrive/yolov4-darknet/darknet/data/obj.names
backup = ./checkpoint
classes = 1
train = ./data/obj/train.txt
valid = ./data/obj/test.txt
names = ./data/obj.names
backup = ./checkpoint
You had remains of colab and the path should not have "darknet" in it.
the downloaded train.txt is for colab and has oto be updated from: /content/gdrive/My Drive/yolov4-darknet/darknet/data/obj/train/3fe012d7a03f9927.jpg to: ./data/obj/train/3fe012d7a03f9927.jpg
You fogot to mention that "checkdir" should be created under darknet
!mkdir checkpoint
You forgot to mention that yolov4.conv.137 should be under ./darknet
Now the traing command works:
!./darknet detector train data/obj.data cfg/yolov4-obj.cfg yolov4.conv.137 -dont_show -map
It executed fine and then gave the message below: " Error: cuDNN isn't found FWD algo for convolution"
Tensor Cores are disabled until the first 3000 iterations are reached. (next mAP calculation at 1000 iterations) 10: -nan, -nan avg loss, 0.000000 rate, 9.552273 seconds, 640 images, 6.814552 hours left Resizing, random_coef = 1.40 512 x 512 Error: cuDNN isn't found FWD algo for convolution.
Tensor Cores are disabled until the first 3000 iterations are reached. (next mAP calculation at 1000 iterations) 10: -nan, -nan avg loss, 0.000000 rate, 9.552273 seconds, 640 images, 6.814552 hours left Resizing, random_coef = 1.40
512 x 512 Error: cuDNN isn't found FWD algo for convolution. ALPR_inference_my2.ipynb.txt
I have checked cudnn8 with the nvidia procedure:
cd cudnn_samples_v8
cd mnistCUDNN
make clean && make
./mnistCUDNN
Executing: mnistCUDNN cudnnGetVersion() : 8303 , CUDNN_VERSION from cudnn.h : 8303 (8.3.3) Host compiler version : GCC 9.4.0
There are 1 CUDA capable devices on your machine : device 0 : sms 68 Capabilities 7.5, SmClock 1650.0 Mhz, MemSize (Mb) 11263, MemClock 7000.0 Mhz, Ecc=0, boardGroupID=0 Using device 0 Resulting weights from Softmax: 0.0000000 0.0000008 0.0000000 0.0000002 0.0000000 1.0000000 0.0000154 0.0000000 0.0000012 0.0000006 Result of classification: 1 3 5 Test passed!
I have followed the procedure for training. Made some corrections to make it start Something is not working. will wait a few hours as it appears to be running, but GPU load is at 1-2%. Sun Apr 3 23:22:57 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.60.02 Driver Version: 512.15 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:0A:00.0 On | N/A |
| 25% 33C P8 7W / 260W | 10141MiB / 11264MiB | 1% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 4258 C /darknet N/A |
| 0 N/A N/A 32718 C /python3.7 N/A |
+-----------------------------------------------------------------------------+
For this error :
subdivisions
value in yolov4-obj.cfg file to 32 or 64.I have cuda 10.0 and 11.0. will check reducing subdivisions as suggested.
From: sanyam83 @.> Sent: Monday, April 4, 2022 10:18:26 AM To: spmallick/learnopencv @.> Cc: tadam98 @.>; Author @.> Subject: Re: [spmallick/learnopencv] learnopencv/ALPR/ is not working (Issue #658)
For this error :
— Reply to this email directly, view it on GitHubhttps://github.com/spmallick/learnopencv/issues/658#issuecomment-1087202036, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFBMCTYWZ7BI5SEIVFTPS23VDKJUFANCNFSM5RMC43WQ. You are receiving this because you authored the thread.Message ID: @.***>
Increasing the value in ./darknet/cfg/yolov4-obj.cfg file to subdivisions=32 works ! It is running now. GPU memory use is now down to `6GB (from 10.5GB with subdivisions=16. GPU utilization is now 50-80%
Every 2.0s: nvidia-smi MICKEY-2080TI-wsl: Mon Apr 4 14:04:41 2022
Mon Apr 4 14:04:42 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.60.02 Driver Version: 512.15 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:0A:00.0 On | N/A |
| 55% 68C P2 250W / 260W | 6104MiB / 11264MiB | 78% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 11842 C /darknet N/A |
| 0 N/A N/A 32718 C /python3.7 N/A |
+-----------------------------------------------------------------------------+
After 6h it is still at:
Tensor Cores are disabled until the first 3000 iterations are reached.
(next mAP calculation at 2900 iterations)
2820: 0.278743, 0.474088 avg loss, 0.001000 rate, 2.213002 seconds, 180480 images, 2.936276 hours left
Resizing, random_coef = 1.40
416 x 416
try to allocate additional workspace_size = 82.58 MB
CUDA allocate done!
no further progress can be seen.
I rebooted, changed subdivisions=32 and it ran for some time and got stuck again:
Tensor Cores are disabled until the first 3000 iterations are reached.
544: 1.236107, 1.296011 avg loss, 0.000088 rate, 2.285857 seconds, 34816 images, 3.919907 hours left
I rebooted, cleaned ./darknet/checkpoint. It got stuck again at 3608 with loss=0.4.
These errors are all because memory keeps running out, try increasing the subdivisions more.
Training still "dies" in the middle. I will try it with colab on the larger GPU just to see it training all the way to the end. I do not have a memory issue when changing to subdivisions=32/64. could be heating, but I saw no indication of this (an I have water cooling on the RTX 2080TI. I see 225W/265W in nvidia-smi).
Well, it died in the middle on colan with K80 GPU. I changed subdivisions to 32. Forget Colab - no chance to get a GPU for more that 30 minutes.
I am also compiling darknet to 2080 TI by uncommenting the correct row in the Makefile to check again on my 2080.
Got sutck here: (next mAP calculation at 3200 iterations)
Tensor Cores are used. Last accuracy mAP@0.50 = 67.72 %, best = 68.84 % 3177: 0.635285, 0.473156 avg loss, 0.001000 rate, 3.053653 seconds, 203328 images, 2.409412 hours left
Cant get the training completed. Software is "hard stuck" and does not respond to cntl/c. Any idea what can cause this? Machine has 64GB and 32 cores and RTX 2080 TI which was 50% of memory use.
Question: is there any log that can show anything?
I have successfully downloaded and compiled darknet on Windows 11, CUDA 11.6 and cuDNN 8.4 using the simple instructions of darknet readme.
CUDA-version: 11060 (11060), cuDNN: 8.4.0, GPU count: 1
OpenCV version: 4.5.5
Using the settings described hereinabove, based on the guidance in the Notebook, training was successfully completed:
Set -points flag:
`-points 101` for MS COCO
`-points 11` for PascalVOC 2007 (uncomment `difficult` in voc.data)
`-points 0` (AUC) for ImageNet, PascalVOC 2010-2012, your custom dataset
mean_average_precision (mAP@0.50) = 0.897842
Saving weights to ./checkpoint/yolov4-obj_6000.weights
Saving weights to ./checkpoint/yolov4-obj_last.weights
Saving weights to ./checkpoint/yolov4-obj_final.weights
If you want to train from the beginning, then use flag in the end of training command: -clear
It could be that my WSL2/Ubuntu 18.04 having CUDA 10.2 has some issues with darkent that used to get stuck. The underlying Windows 11 training is very suitable. (I am keeping the WSL2/Ubuntu with CUDA 10.2 chDNN 7.6.5 for Tensorflow 14).
So for now, all is good.
The attached is not working even after corrections. (See my email)