Loss not converging - Githubissues

bobbilichandu commented 3 years ago

This is after 2848 iterations. I am training on Crowdhuman dataset. What could be the possible reason for this type of behaviour?

jkjung-avt commented 3 years ago

Have you made any changes to the model architecture (cfg file) or to how the training data is processed/prepared?

bobbilichandu commented 3 years ago

Nope, I followed the steps exactly as stated. I haven't edited any cfg file. What might have caused this issue?

jkjung-avt commented 3 years ago

In general, there could be a lot of possible causes:

Bad training data
Improper model architecture for the learning task
Improper loss settings
Bad training hyperparameters, e.g. learning rate too large
......

But I don't know why it does not work for you, if you have followed all steps in this tutorial...

bobbilichandu commented 3 years ago

I rechecked everything all the cfg files and dataset, they are the same as in the repo. Might there be an issue with this chart?

Sample output: MJPEG-stream sent. Loaded: 0.000037 seconds v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 139 Avg (IOU: 0.753777), count: 10, total_loss = 81.239525 v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 150 Avg (IOU: 0.750873), count: 27, total_loss = 48.345337 v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 161 Avg (IOU: 0.786037), count: 17, total_loss = 14.176614 total_bbox = 34617431, rewritten_bbox = 4.239789 % v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 139 Avg (IOU: 0.752733), count: 39, total_loss = 693.538025 v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 150 Avg (IOU: 0.762998), count: 29, total_loss = 61.059933 v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 161 Avg (IOU: 0.791270), count: 20, total_loss = 19.786587 total_bbox = 34617519, rewritten_bbox = 4.239787 %

Can you please have a look at this and let me know if there is any obvious issue.

jkjung-avt commented 3 years ago

Sorry, I don't have any idea at this moment...

Have you tried training the model on Google Colab? Maybe you could make comparison and try to identify the problem.

bobbilichandu commented 3 years ago

Training is completed and I got a map of 82.67%(tested as mentioned in readme). I think there is some issue with chart. Is that so? Or is the training not done properly?

jkjung-avt commented 3 years ago

Right... I think some recent changes of alexeyab/darknet cause training loss to be much higher than before.

As you've got a trained model with mAP@0.50 at 82.67%, I think the training has been done successfully. If you really want to observe loss in the chart, you could modify max_chart_loss in the cfg file.

https://github.com/jkjung-avt/yolov4_crowdhuman/blob/e4e7df7624adac38371647d5ba96a6e6512f66cc/cfg/yolov4-crowdhuman-608x608.cfg#L25

bobbilichandu commented 3 years ago

Oh ok thanks. This might solve the issue

PiyalGeorge commented 3 years ago

@chandu1263 where you able to resolve this?

@jkjung-avt I'm also getting this graph. Following your repo, I trained Yolov4-tiny 608X608 with crowdhuman dataset (2 classes). I trained for 30000 iterations and still the graph hasn't gone down a bit (Blue line on the top of the picture is the graph). Were you also getting this kind of graph? training size ~ 12k, test size ~ 2k

chart_yolov4-tiny-custom

bobbilichandu commented 3 years ago

I got the same graph, but the training is happening. Plotting was not done properly. Try reducing the parameter 'max_chart_loss=40.0' to 10.0 or lesser value, if you want the graph. I didn't want to train the model the again just for this graph.

PiyalGeorge commented 3 years ago

@chandu1263 Thanks, i'll try that. In Yolov4-tiny, 'max_chart_loss' is not there. @jkjung-avt @chandu1263 I'm asking this because i'm curious to know about this weird graphs. With another dataset, i trained a Yolov4-tiny 608X608 for one class. Then i got graph like below. But the model was really Accurate. I'm not understanding why the graphs is being plotted like this.

[train size ~ 12k and test size ~ 2k]

chart_yolov4-tiny-custom

bobbilichandu commented 3 years ago

darknet changed their parameter values a bit, and this repo was developed on some older version(not too old). If you want to know more about this plotting, have a look at the train script and cfg in darknet repo.

jkjung-avt commented 3 years ago

Right... I think some recent changes of alexeyab/darknet cause training loss to be much higher than before.

@PiyalGeorge The loss value (275.8425) on your chart looks similar to what I've been seeing recently. The solution would be to set, say, 'max_chart_loss=1000.0' in the cfg file.

PiyalGeorge commented 3 years ago

@jkjung-avt @chandu1263 Thanks a lot. So conclusion i can get from you guys is: The model is still training properly and issue is only with graph plotting.

jkjung-avt / yolov4_crowdhuman

Loss not converging #8