Closed xingguang12 closed 8 months ago
π Hello @xingguang12, thank you for your interest in Ultralytics YOLOv8 π! We recommend a visit to the Docs for new users where you can find many Python and CLI usage examples and where many of the most common questions may already be answered.
If this is a π Bug Report, please provide a minimum reproducible example to help us debug it.
If this is a custom training β Question, please provide as much information as possible, including dataset image examples and training logs, and verify you are following our Tips for Best Training Results.
Join the vibrant Ultralytics Discord π§ community for real-time conversations and collaborations. This platform offers a perfect space to inquire, showcase your work, and connect with fellow Ultralytics users.
Pip install the ultralytics
package including all requirements in a Python>=3.8 environment with PyTorch>=1.8.
pip install ultralytics
YOLOv8 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):
If this badge is green, all Ultralytics CI tests are currently passing. CI tests verify correct operation of all YOLOv8 Modes and Tasks on macOS, Windows, and Ubuntu every 24 hours and on every commit.
@xingguang12 hello! Addressing class imbalance is indeed crucial for improving mAP on your dataset. Here are some concise tips:
For more detailed guidance on these topics, you might find our documentation on Train and Val modes helpful. Good luck with your training! π
@xingguang12 hello! Addressing class imbalance is indeed crucial for improving mAP on your dataset. Here are some concise tips:
- Aim for a minimum of a few hundred instances per class after augmentation to give the model a fair chance to learn. There's no strict rule, but more variety generally helps.
- Split your dataset before augmentation to maintain a valid representation of the original distribution in your validation set. Having very few instances in the test set can lead to high variance in mAP for those classes, so try to keep a reasonable number in the test set as well.
- Data augmentation is a common strategy for dealing with class imbalance. Consider using a mix of geometric and photometric augmentations to introduce diversity. Also, explore techniques like weighted loss functions to give more importance to underrepresented classes during training.
For more detailed guidance on these topics, you might find our documentation on Train and Val modes helpful. Good luck with your training! π
I appreciate your thanks! Now, let's address your questions regarding data augmentation in the context of your imbalanced dataset:
Background: My dataset has extremely imbalanced class distribution. I plan to split the dataset (train:val:test = 7:2:1, totaling 10,000 images) and then perform data augmentation.
For classes with very few instances, should I directly delete them? For example, if class "p" has 30 instances, and after splitting the dataset, the validation set has only 6 instances, and the test set has only 3 instances of class "p," evaluating the training effectiveness for class "p" might be subject to significant randomness. For classes with fewer than 10 instances, there might be only one or zero images containing that class in the test set. So, should I delete classes with fewer instances (I tentatively set the threshold at 70) and focus on training classes with relatively more instances?
Following the principle of splitting the dataset before augmentation, should I augment classes with fewer instances in the training set (ensuring at least 100 instances per class)? Is it unnecessary to balance all classes to a similar quantity, as long as the instance count for each class in the training set is not less than 100, facilitating model training? My dataset's distribution is shown in the graph, where most classes have few instances.
Common data augmentation techniques include random flipping, salt-and-pepper noise, brightness enhancement, color space distortion, and their combinations. Do you have any recommended data augmentation methods and specific parameter values?
For classes with an awkward number of instances, such as 40 instances, can I randomly allocate 30 instances of this class to the validation and test sets in a 2:1 ratio (to avoid large mAP variance for this class)? Then, can I augment the remaining 10 instances to reach a total of 100 instances? I understand that this approach may disrupt the original distribution of the test and validation sets, but it's done to better detect this class. Is this method feasible?
For certain special classes, can I use the replacement method (i.e., replacing the class in the image with icons from other classes, as shown in the image below) to augment instances of these classes? For these classes, can I place all instances in the test and validation sets, while instances in the training set are entirely composed of replacements from images of other classes?
Some classes exhibit strong similarities, as shown in the image below. Does labeling these classes collectively as one category have any impact on training? I'm especially looking forward to your response. Thank you very much in advance.
Hello @xingguang12, let's tackle your concerns:
Remember, the key is to maintain the integrity of your validation and test sets while using augmentation to improve the training set. Good luck! π
π Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.
For additional resources and information, please see the links below:
Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!
Thank you for your contributions to YOLO π and Vision AI β
Search before asking
Question
When training my own dataset with YOLOv8, I encountered some issues and would like to seek your advice. My dataset has extremely imbalanced class distribution, with some classes having fewer than 10 instances, while others have a few thousand instances. Currently, the classes with fewer instances yield poor training results, and I am considering using offline data augmentation to augment my dataset (as it's challenging to find more images for those underrepresented classes online). However, I have a few questions:
Additional
No response