the role of compare_yolo_class

david0100101 commented 2 months ago

Hi guys, i want to apologize if this is misleading title, i 'm trying to comprehend yolo loss calculation in yolo_layer.cpp file, one detail particularly seems like a bug, on line 603 int class_id_match = compare_yolo_class(l.output, l.classes, class_index, l.w * l.h, objectness, class_id, 0.25f); Judging by variable name this function suppose to produce boolean value if current truth class we comparing box against appears in output in any way, yet inside this function we simply returning true if any random class is above certain threshold. Am i wrong ? Moreover it contains class_id as an argument but it is unused inside. This seems really counterintuitive. Again sorry if this was already brought up or this should be this way for some unknown reason. If this is correct maybe it can be renamed to any_class_response or something similar to better reflect function purpose and return value ?

Tony904 commented 1 month ago

I looked into this recently as I too am trying to understand yolo's loss function. I looked back at the change history when this change was made and it originally seemed to function the way you think it would, by checking if the passed class_id had a predicted score over 0.25, but then it was changed to what it is now. So that would lead me to believe the change was intentional and somehow provides better results, but ultimately I do not know.

david0100101 commented 1 month ago

I looked into this recently as I too am trying to understand yolo's loss function. I looked back at the change history when this change was made and it originally seemed to function the way you think it would, by checking if the passed class_id had a predicted score over 0.25, but then it was changed to what it is now. So that would lead me to believe the change was intentional and somehow provides better results, but ultimately I do not know.

Hi @Tony904 ! I did some research regarding yolo and this specific issue and i want to share with you some of my insights, maybe this could be helpful to someone. First of all in order to understand the difference that compare_yolo_class makes we must understand yolo output tensor structure, it consists of square grid of specific size, each grid cell has depth, "depth filler" which is used here looks like this:
[p, x, y, w, h, class1...classN] this is one slice of tensor (depth 1), we are stacking them up so it looks like this
[p, x, y, w, h, class1...classN][p, x, y, w, h, class1...classN][p, x, y, w, h, class1...classN]...[p, x, y, w, h, class1...classN]
here p is probability there is something of interest in there within cell, [x,y,w,h] box localization params followed by array of class related parameters (ranging from 0 to 1). (btw i have also seen alternative implementations in another yolo related repos). So if we for example to index this yolo "sausage" (haha) it would be something like this: example index [12,10,5,2], this 4d index would hit cell at (x12,y10) within grid and going into depth it would reach first class of first cell of second batch (if we are training with batches, otherwise you can drop last entry), hope this makes sense.

Now, here is sort of pseudocode from this repo which described training (gradient derivation), at least how i understood it:

- for each cell prediction (diving into our output yolo tensor cell depth)
    - look at our truth tensor depth, find in there cell with highest response from our network output
    - in another words there can be multiple objects to be predicted, we pick one to which our network already inclined from its original randomized parameters state
    - we performing update (gradient backwarding) ONLY relative to this specific truth which we discovered out network to be originally inclined to
    - if there is no "sensing" from our network that something in there or cell is empty we do NOTHING
    - next follows gradient derivation of the form (truth[index] - output[index]) with certain adjustments (like using special coordinates encoding, focal loss (special function which would scale our class related gradient), using intersection over union in objectness evaluation)

It is important to emphasize that these "improvements" are there to improve network training ability above simple truth[index] - output[index] approach, it is still going to train and arrive at certain error local minima regardless of whether they are used or not.
Looking at this code you might scratch your head and ask what is going on here, the reasoning behind this, weird at first glance, logic might be the following: ratio of truth boxes and empty boxes going towards zero in the limit, in another words there is only few items within images surrounded by nothingness, if we are going to have a lot of images in our training set gradient would consist of mostly "nothingness related" gradient impulse, which will "overpower" gradient and diverge our parameters matrices to the place where they should not be, structure of "filled" cells with items would get lost and network would mostly learn how NOT to classify everything rather than to classify something (again i hope this make sense). And this brings us to my original question, i think this is why modification was made (regarding removing limitation to specific class in compare_yolo_class), because during training original matching space was too small and training was initially too slow, meaning original randomized network could not locate something to required degree of accuracy only so rarely output was above established threshold and backwarding gradient was mostly zeros most of the time, so some one have removed this restriction enlarging space of points from which we can backward gradient which would lead to faster "catching up" to objective function. Last thing i want to mention is that as i said no matter what we actually do there as long as we simply put there truth - output that should "work", but all these troubles arise from this unusual function shape that we must learn.

hank-ai / darknet

the role of compare_yolo_class #73