Closed fatihbaltaci closed 5 years ago
@fatihbaltaci train longer. Darknet trains to 270 epochs. All the tutorials use detect.py with latest.pt to plot the results.
@glenn-jocher So, is it normal that mAP, P and R are exactly 0 after 13 epoch? Before multi-gpu support with single gpu I got 0.361 mAP after 3 epoch.
@fatihbaltaci you need to use the latest commit. Your screenshot is from a very out of date version of this repo.
@fatihbaltaci this appears connected to the new dataloader. See #141.
Current workaround is to set num_workers
to 0 until we can debug further.
@fatihbaltaci this may be fixed now, though I've set num_workers=0
as the default out of caution.
@fatihbaltaci I face the similar situation, while training my own datasets after 100 epoch the recall and map still zero. However, when I test another yolo v3 project deployed by tensorflow frame, 20 epoch is enough to gain a not bad results. I guess the optimizer may be the reason of slowly converge. Now, I am testing to fine tune the optimizer hyper-parameter, like using adam algorithm and add the weight_decay and so on. I will post my test result if I can find a way to fast converge.
@glenn-jocher I tried with the latest commit (943db40f1ad600cdcdc40ff06588cd9c9bf2523f) (num_workers=4). It gives again zero mAP. I tried with num_workers=0 and it gives zero mAP also.
@perry0418 I am training YOLOv3 with COCO2014 dataset. I think slow convergence is not the issue for the zero mAP. Did you try with num_workers=0 ?
@glenn-jocher I tried on single GPU and there is no problem, mAP is not zero.
I changed multi-gpu to single-gpu with this command:
export CUDA_VISIBLE_DEVICES=1
@fatihbaltaci that's so strange. Yes that is the latest commit. I ran two epochs on a GCP instance to test the commit right before this one, and got this below. I used num_workers=4
for train.py but used num_workers=0
in test.py. I saw normal results with num_workers=4
using test.py with yolov3.weights, so I assumed independenty it was working fine as well, hence the switch to set both defaults to 4 workers. My test was single GPU.
@perry0418 The expected mAP should be around 0.15 to 0.25 for the first couple epochs. It is this high because we start from the darknet 53 backbone, not from scratch. See the README for training plots.
0/269 7328/7328 0.217 0.744 8.78 2.02 11.8 109 0.391 0.187 0.205 0.186
1/269 7328/7328 0.191 0.489 2.67 1.27 4.61 103 0.393 0.267 0.281 0.26
...
@glenn-jocher With single gpu, there is no problem. Did you try training with multi-gpu?
@fatihbaltaci everything works as intended as long as you limit yourself to 1 GPU. I'm on travel today so won't be able to debug further until tomorrow. The issue probably has to do with the padding/collating of the targets (DataLoader requires all 16 target Tensors to have the same shape). I suspect its possible the targets are not arriving in the correct order under multiGPU + multiple workers.
For now 1 GPU + 4 workers is the fastest correct implementation (the current commit default).
I'm debugging this, but its pretty difficult because we don't have multiple GPU machines on premise, we use GCP in this case. I created a https://github.com/ultralytics/yolov3/tree/multi_gpu branch with updates, and tried adding a custom collate function to the dataloader, which works great, but did not solve the problem. I also saved the images and targets for the first few batches, and the multiGPU overlays appear identical to the single GPU overlays. Additionally the multi and single GPU losses seem to track each other well, though multi-GPU --resume
shows drastically different losses than single GPU resume. test.py continues to operate correctly with yolov3.weights
under all GPU/worker combinations, so the issue appears isolated to training rather than testing.
Here are the first 3 batches of overlays using 2 GPUs on GCP. The targets appear to line up correctly on the image, which negates my first idea that there was a target ordering mismatch. I remain unsure of where the problems stems, though it is isolated to multiple GPU training. BATCH0 BATCH1 BATCH2
@glenn-jocher @fatihbaltaci I'm glad to tell you guys my latest debug has solved this problem. This problem is caused by the wrong input format of dataloader. when we use the multiple GPU and processes, the pytorch office example is recommended to use torch.nn.parallel.DistributedDataParallel and torch.utils.data.distributed.DistributedSampler(https://github.com/pytorch/examples/blob/master/imagenet/main.py). below is my test results, with 4 num_works and 2 GPU and resume training. later, I will submit a pull request.
@fatihbaltaci @glenn-jocher
here is my modified repository address which has solved the problem mentioned above. https://github.com/perry0418/yolov3
@perry0418 thanks for the PR! I merged it, but I get an error on the merge, and on your local repo as well:
rm -rf yolov3 && git clone https://github.com/perry0418/yolov3
cd yolov3 && python3 train.py
Namespace(accumulate=1, batch_size=16, cfg='cfg/yolov3.cfg', data_cfg='cfg/coco.data', dist_backend='nccl', dist_url='tcp:
//224.66.41.62:23456', epochs=270, img_size=416, multi_scale=False, num_workers=4, rank=-1, resume=False, world_size=-1)
Found 2 GPUs
Using cuda _CudaDeviceProperties(name='Tesla V100-SXM2-16GB', major=7, minor=0, total_memory=16130MB, multi_processor_count=80)
Traceback (most recent call last):
File "train.py", line 233, in <module>
num_workers=opt.num_workers
File "train.py", line 65, in train
dist.init_process_group(backend=opt.dist_backend, init_method=opt.dist_url,world_size=opt.world_size, rank=opt.rank)
File "/opt/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 354, in init_process_group
store, rank, world_size = next(rendezvous(url))
File "/opt/anaconda3/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 88, in _tcp_rendezvous_handler
raise _error("rank parameter missing")
ValueError: Error initializing torch.distributed using tcp:// rendezvous: rank parameter missing
Please set the tcp ip to your machine’s IP and rank=0
Thanks. What should world_size
be set to? Also do you know how world_size
relates to num_workers
?
@perry0418 Thank's for your support. Problem is solved.
0/269 3658/3664 0.225 0.847 12.1 2.26 15.4 202 0.659 0/269 3659/3664 0.225 0.847 12.1 2.26 15.4 164 0.628 0/269 3660/3664 0.225 0.847 12.1 2.26 15.4 259 0.504 0/269 3661/3664 0.225 0.847 12.1 2.26 15.4 226 0.646 0/269 3662/3664 0.225 0.847 12 2.26 15.4 231 0.666 0/269 3663/3664 0.225 0.846 12 2.26 15.4 266 0.654 0/269 3664/3664 0.225 0.846 12 2.26 15.4 110 4.03
Found 2 GPUs Using cuda _CudaDeviceProperties(name='GeForce GTX 1080 Ti', major=6, minor=1, total_memory=11178MB, multi_processor_count=28)
Image Total P R mAP 32 5000 0.149 0.131 0.112 1.73s 64 5000 0.163 0.157 0.143 0.358s 96 5000 0.169 0.181 0.168 0.378s 128 5000 0.162 0.174 0.161 0.345s 160 5000 0.158 0.169 0.157 0.367s 192 5000 0.152 0.161 0.148 0.355s 224 5000 0.152 0.159 0.146 0.382s 256 5000 0.149 0.157 0.143 0.383s 288 5000 0.153 0.163 0.148 0.329s 320 5000 0.148 0.157 0.143 0.356s 352 5000 0.144 0.152 0.138 0.386s 384 5000 0.146 0.154 0.14 0.349s 416 5000 0.145 0.156 0.142 0.401s
@glenn-jocher here you can find all you need. https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group
Why i have been updated the newest version of the code, the mAP is stll 0 after severial epochs. Even when i use the yolov3.weights model to test, the situation can't be improved. I use multi-GPU and the newest code.
Why i have been updated the newest version of the code, the mAP is stll 0 after severial epochs. Even when i use the yolov3.weights model to test, the situation can't be improved. I use multi-GPU and the newest code.
I am sorry, this is my fault. Thanks a lot :->
@perry0418 thanks a lot for the PR!! It looks like the issue is resolved now, so I am happily closing the issue.
@ygliang2009 Get the newest version of this repository may be works for you now. It works on my machine.
Why i have been updated the newest version of the code, the mAP is stll 0 after severial epochs. Even when i use the yolov3.weights model to test, the situation can't be improved. I use multi-GPU and the newest code.
I am sorry, this is my fault. Thanks a lot :->
Hi! How did you resolve this issue? I am having the same issue as well.
Many thanks in advance.
@oykusahin hi there! It's great to hear that you found a resolution to the issue! The YOLO community and the Ultralytics team have put in a lot of effort to address such challenges. If you encounter any further issues, don't hesitate to reach out. Keep up the great work!
@glenn-jocher Hi, when I run test.py with latest.pt or best.pt; mAP, precision, and recall became zero. But there is no problem with official yolo weights (yolov3.weights).
My training command is:
python train.py --img-size=608
Is there any problem with model saving?