yunshangyue71 commented 6 months ago

Search before asking

[X] I have searched the YOLOv8 issues and discussions and found no similar questions.

Question

In your code focalloss is written like this

alpha_factor = label alpha + (1 - label) (1 - alpha) loss *= alpha_factor

Usually there are many negative samples, such as candidate boxes in target detection tasks. Is the code doing this wrong? Should the proportion of positive samples be enlarged? alpha_factor = label (1-alpha) + (1 - label) alpha loss *= alpha_factor

Additional

No response

github-actions[bot] commented 6 months ago

👋 Hello @yunshangyue71, thank you for your interest in Ultralytics YOLOv8 🚀! We recommend a visit to the Docs for new users where you can find many Python and CLI usage examples and where many of the most common questions may already be answered.

If this is a 🐛 Bug Report, please provide a minimum reproducible example to help us debug it.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset image examples and training logs, and verify you are following our Tips for Best Training Results.

Join the vibrant Ultralytics Discord 🎧 community for real-time conversations and collaborations. This platform offers a perfect space to inquire, showcase your work, and connect with fellow Ultralytics users.

Install

Pip install the ultralytics package including all requirements in a Python>=3.8 environment with PyTorch>=1.8.

pip install ultralytics

Environments

YOLOv8 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

Status

If this badge is green, all Ultralytics CI tests are currently passing. CI tests verify correct operation of all YOLOv8 Modes and Tasks on macOS, Windows, and Ubuntu every 24 hours and on every commit.

glenn-jocher commented 6 months ago

Hello! 😊 Great question about the Focal Loss implementation.

The current formulation of the Focal Loss in our code:

alpha_factor = label * alpha + (1 - label) * (1 - alpha)
loss *= alpha_factor

is indeed designed to address class imbalance by adjusting the weight given to positive vs. negative samples. Here, alpha is a hyperparameter that helps in weighting. When alpha is greater than 0.5, it gives more weight to the negative class (since there are usually many more negatives, especially in object detection tasks).

The alternative suggestion you made:

alpha_factor = label * (1-alpha) + (1 - label) * alpha

would actually reverse the emphasis, giving more weight to the positive samples (which could be useful in cases where positives are very rare or you want to focus more on them).

Both formulations can be acceptable depending on your specific task, dataset, and what you aim to achieve (e.g., balancing classes or emphasizing detection of rare objects). The key is to experiment with alpha and observe the impact on your model's performance, adjusting it according to your needs.

Hope this helps! If there's anything else we can assist with, feel free to ask.

yunshangyue71 commented 6 months ago

Hello! Thank you for your patience in recovering, but I still doubt the rationality of your approach.Let us reason about it.

The default alpha=0.25 is used in v8. Let’s set alpha=0.25.

alpha_factor = label * 0.25 + (1 - label) * (1 - 0.25)
alpha_factor = label * 0.25 + (1 - label) * 0.75

Considering only the loss function focalloss, I think if multiplied by a coefficient of 4, without considering random situations, the final trained network will be the same.

alpha_factor = label * 1 + (1 - label) * 3

let compare with cross entropy loss' alpha factor

`alpha_factor = label 1 + (1 - label) 1

Let's think about this from another angle.

Suppose we have 10,000 negative samples and 1,000 positive samples. Compared with cross entropy, yolov8's method is equivalent to copying 1,000 negative samples, taking pictures as an example, 2 times, and then naming them copy1 and copy2, and putting them into the data set. I think doing this will make the positive and negative samples more imbalanced.

I admit that yolov8's effect is very good, and I understand based on your comments，this code is borrowed from some work of TF. Please forgive me for my bolder questioning.

glenn-jocher commented 6 months ago

Hello! 👋 It's great to see you're diving deep into understanding the intricacies of the Focal Loss implementation for YOLOv8, and your questions are indeed thought-provoking!

You're correct in highlighting how the alpha factor contributes to balancing the model's learning from positive vs. negative samples, especially given the class imbalance typical in detection tasks. The choice of alpha=0.25 is indeed a method borrowed from earlier works, aiming to give more weight to the rarer positive samples while not entirely neglecting the abundant negatives.

The transformation you suggested, multiplying the alpha_factor by a coefficient (e.g., 4 in your example) to make positive and negative terms whole numbers, conceptually doesn't change the essence of the loss; it would simply scale the loss value for both classes. The relative weight (or emphasis) placed on the positive vs. negative samples would remain the same.

Regarding the analogy with copying negative samples, it's a bit more nuanced. The Focal Loss doesn't exactly duplicate negative samples but rather scales the contribution of each sample to the loss based on its classification difficulty, as determined by the model's output probability for the true class. This way, "easy" negatives (with high model confidence) contribute less to the loss, allowing the model to focus more on "hard" examples and positives.

Your curiosity and willingness to question and understand deeply are much appreciated! 🙌 Remember, the choice of loss functions and their parameters often comes down to empirical findings and the specific nature of the dataset and task at hand.

Keep the great questions coming! If there's anything more we can help clarify, we're all ears. Happy coding!

yunshangyue71 commented 6 months ago

Sorry, I disagree with what you said. Let me elaborate on the above example in a little more detail. it's a bit more nuanced.

simple example

2k positive， 6k negative， batch size = 120 So a batch consist 30 positive and 90 negative。 We assume that CNN encounters a batch as described below.

origin cross entropy

Crossentropy alpha factor is alpha_factor = label * 1 + (1 - label) * 1 The first part is the contribution of 30 positive samples, and the second part is the contribution of 90 negative samples. Although the result of the entire formula is 1, it does not prevent us from considering cross entropy as a special case of focal loss. Loss has 90/(30+90) = 90/120composed of negative.

Then we use focalloss to write the formula.

alpha = 0.25
alpha_factor = label * 0.25 + (1 - label) * (1-0.25)
==> label * 1 + (1 - label) * 3
Loss has (90*3 )/(30+90*3) = 270/300=90/100 composed of negative.

The contribution of the first part still comes from 30 positive samples, and the second part comes from 90 negative samples, but the weight of the negative samples becomes 3 times that of the positive samples.

We are writing it using a simple copy and see the crossentropy loss

2k positive， 6k negative，6k negative copy1, 6k negative copy2 batch size = 120 So a batch consist 120 2/(2+6+6+6) = 12 positive and 1206/(2+6+6+6) = 36 negative 36 negative copy1 36 negative copy2 Crossentropy alpha factor is alpha_factor = label * 1 + (1 - label) * 1 The first part is the contribution of 12 positive samples, and the second part is the contribution of 363=108 negative samples. Loss has (363 )/120 = 108/120=90/100 composed of negative.

I think the two are equivalent

Because alpha is a fixed value, it can be simply considered that the categories of unbalanced small samples are simply copied. If it is a classification task, copy the image. If it is a target detection task, simply copy the negative or positive. This is the reason why I object to what you said. it's a bit more nuanced.

Let's take a look again. Is it reasonable to set alpha to 0.25?

If the losses of the positive samples and negative samples above are not equal, the calculated final values will not be equal either. But ultimately the proportion of loss provided by negative samples will definitely increase. It is only suitable when the loss caused by a small number of positive samples is greater than the loss caused by a large number of negative samples. I took a look at yolov8's anchor assign. Maybe I didn't look carefully enough. But I found that all negative samples are used when calculating the loss. So the above will not hold true. I think yolo is an excellent project, so with a little glitch here, the overall project will still work great.

To extend a bit, why should we pay attention to the ratio of positive and negative samples.

If there are many negative samples, the network will tend to learn negative samples, so the recall will be low. If there are many positive samples, the network will tend to learn positive samples, so the precision will be low.

yolo is a very good project in terms of data exploration, data enhancement, network design, loss design, code writing, etc. No matter what direction I take on, I can always draw some nourishment here.

glenn-jocher commented 6 months ago

Hey there! 👋 Thank you for diving deep into the Focal Loss discussion and sharing your detailed analysis. It's always great to see engaged and insightful reasoning!

You make an interesting point comparing the effects of the alpha factor in Focal Loss to simply copying samples. It's true that the alpha factor can disproportionally increase the weight of negative samples, effectively simulating an increase in their presence. This mechanism aims to address class imbalance by highlighting the impact of the rarer positive samples through adjusting their loss contribution.

However, it differs from simply copying negative samples. The goal of Focal Loss, with its scaling factors (alpha and the modulating factor (1-p_t)^gamma), is to dynamically adjust the emphasis on hard-to-classify examples, rather than statically increasing the dataset's size. This adjustment is designed to make the network focus more on examples it currently struggles with, whether positive or negative.

Regarding the choice of alpha=0.25, it's more of a starting point based on empirical evidence from the original Focal Loss paper and is indeed task-dependent. Our goal at Ultralytics is to provide a strong baseline, which users can tune according to their specific datasets and objectives.

Your point about the potential shift towards learning more from negative samples due to the chosen alpha is valid. It’s one of those areas where the art of model tuning comes into play 🎨. Finding that sweet spot between recall and precision, influenced by positive and negative sample ratios, is key to tailoring YOLOv8 to specific tasks.

Thanks again for your analysis and kind words about YOLO! Your contributions help us all think more critically about these mechanisms. Let's keep the dialogue open and continue learning from each other. Happy experimenting! 🚀

yunshangyue71 commented 6 months ago

Thanks for yolov8 outstanding work and your patient recovery. I wish yolo will get better and better.

glenn-jocher commented 6 months ago

Thank you so much for your kind words and support! 😊 We're thrilled to have such an engaged and positive community. If you have any more questions or need further assistance, feel free to reach out. Here's to making YOLO even better, together! 🚀

ridvanozdemir commented 6 months ago

Thanks for yolov8 outstanding work and your patient recovery. I wish yolo will get better and better.

If you managed to use the focal loss function in Yolov8, can you tell me how you did it? For my project, I need to assign different alpha values according to the class number.

glenn-jocher commented 6 months ago

Thank you for your kind words! We're glad you're exploring YOLOv8. 😊

To use the Focal Loss function in YOLOv8 with different alpha values for each class, you would typically modify the loss function within the codebase to accept a list of alpha values corresponding to each class.

Here’s a brief example of how you might implement this:

Define a list of alpha values:

alpha_list = [0.25, 0.5, 0.75]  # Example alpha values for three classes

Modify the focal loss calculation in your model’s loss function:

alpha_factor = torch.tensor([alpha_list[i] for i in labels]).to(device)
alpha_factor = alpha_factor * labels + (1 - alpha_factor) * (1 - labels)
loss *= alpha_factor

Please make sure the length of alpha_list matches the number of classes in your dataset.

Let us know if you have more questions or run into any issues! Happy coding! 🚀

ridvanozdemir commented 6 months ago

but first can you give me an example how to use focal loss function instead of BCE?

yunshangyue71 commented 6 months ago

Generally, the network predicts 3 categories, usually outputs 3 class scores, and then softmax. The output target is onehot encoded. So each output can use BCE, and then Focal loss can be used. Multiple alpha parameters mentioned above are also possible.

ridvanozdemir commented 6 months ago

Generally, the network predicts 3 categories, usually outputs 3 class scores, and then softmax. The output target is onehot encoded. So each output can use BCE, and then Focal loss can be used. Multiple alpha parameters mentioned above are also possible.

Can you give me an example code? And which lines you modify in loss.py file?

glenn-jocher commented 6 months ago

@ridvanozdemir absolutely! For using Focal Loss with YOLOv8 when you have 3 class scores that are onehot encoded, you'd modify the loss calculation to incorporate Focal Loss instead of BCE. Here’s a quick example:

Assuming you have a loss.py file, find the section where the classification loss is computed (often using BCE), and you can replace it with something like this:

import torch.nn.functional as F

def focal_loss(inputs, targets, alpha, gamma=2):
    BCE_loss = F.binary_cross_entropy_with_logits(inputs, targets, reduction='none')
    pt = torch.exp(-BCE_loss)  # prevents nans when probability 0
    F_loss = alpha * (1-pt)**gamma * BCE_loss
    return F_loss.mean()

alpha = torch.tensor([0.25, 0.5, 0.75])  # per-class alpha values
gamma = 2  # focusing parameter
loss = focal_loss(predictions, targets, alpha[targets], gamma)

In your loss.py, just swap out where BCE is called with a call to focal_loss(). Make sure predictions are your model's logits and targets are your onehot encoded classes.

Let me know if you need more details on this or anything else! 🚀

github-actions[bot] commented 5 months ago

👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Docs: https://docs.ultralytics.com
HUB: https://hub.ultralytics.com
Community: https://community.ultralytics.com

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO 🚀 and Vision AI ⭐

ultralytics / ultralytics

alpha of FocalLoss #10406

Search before asking

Question

In your code focalloss is written like this

Additional

Install

Environments

Status

The default alpha=0.25 is used in v8. Let’s set alpha=0.25.

Considering only the loss function focalloss, I think if multiplied by a coefficient of 4, without considering random situations, the final trained network will be the same.

let compare with cross entropy loss' alpha factor

Let's think about this from another angle.

I admit that yolov8's effect is very good, and I understand based on your comments，this code is borrowed from some work of TF. Please forgive me for my bolder questioning.

simple example

origin cross entropy

Then we use focalloss to write the formula.

We are writing it using a simple copy and see the crossentropy loss

I think the two are equivalent

Let's take a look again. Is it reasonable to set alpha to 0.25?

To extend a bit, why should we pay attention to the ratio of positive and negative samples.

yolo is a very good project in terms of data exploration, data enhancement, network design, loss design, code writing, etc. No matter what direction I take on, I can always draw some nourishment here.