ultralytics / ultralytics

Ultralytics YOLO11 🚀
https://docs.ultralytics.com
GNU Affero General Public License v3.0
29.47k stars 5.78k forks source link

yolov8-pose predict #8771

Closed jwee1369 closed 3 months ago

jwee1369 commented 7 months ago

Search before asking

Question

Hi,

When i fine-tuned yolov8_pose model with default imgsz=640 and predict at imgsz=640, eveything works fine

However when using the same trained model (imgsz=640) to predict at other sizes like rectangular (1080, 1920), it seems to be throwing the prediction off, the bbox predicted is really small.

This doesn't happen with the detect models like yolov8 where the train size is square but if i were to infer at rect dimension say (1080, 1920) it could still detect reasonably well.

is there a normalization post-process that could be throwing this off for the pose models?

Thanks

Additional

No response

github-actions[bot] commented 7 months ago

👋 Hello @jwee1369, thank you for your interest in Ultralytics YOLOv8 🚀! We recommend a visit to the Docs for new users where you can find many Python and CLI usage examples and where many of the most common questions may already be answered.

If this is a 🐛 Bug Report, please provide a minimum reproducible example to help us debug it.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset image examples and training logs, and verify you are following our Tips for Best Training Results.

Join the vibrant Ultralytics Discord 🎧 community for real-time conversations and collaborations. This platform offers a perfect space to inquire, showcase your work, and connect with fellow Ultralytics users.

Install

Pip install the ultralytics package including all requirements in a Python>=3.8 environment with PyTorch>=1.8.

pip install ultralytics

Environments

YOLOv8 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

Ultralytics CI

If this badge is green, all Ultralytics CI tests are currently passing. CI tests verify correct operation of all YOLOv8 Modes and Tasks on macOS, Windows, and Ubuntu every 24 hours and on every commit.

glenn-jocher commented 7 months ago

@jwee1369 hi there! 👋

It sounds like you're encountering an issue with predictions at different image sizes, especially with rectangular dimensions, using the YOLOv8 pose model. When training with a square image size and then predicting at a different aspect ratio, the model might not perform as expected due to the change in aspect ratio and the way features are scaled.

For pose models, maintaining the aspect ratio during inference is crucial since the model's accuracy is highly dependent on the spatial relationships between keypoints. If you're using rectangular images for inference, consider padding them to maintain the aspect ratio used during training. This way, the model's predictions should remain consistent.

Here's a quick example of how you might pad an image to maintain a square aspect ratio:

from PIL import Image

def pad_to_square(image_path):
    image = Image.open(image_path)
    max_dim = max(image.size)
    new_image = Image.new("RGB", (max_dim, max_dim), (128, 128, 128))
    new_image.paste(image, ((max_dim - image.size[0]) // 2,
                            (max_dim - image.size[1]) // 2))
    return new_image

# Example usage
image_path = 'path/to/your/image.jpg'
padded_image = pad_to_square(image_path)
padded_image.save('path/to/save/padded_image.jpg')

This code snippet creates a new square image with the original image centered and padded with a specific color (in this case, gray). You can then use this padded image for inference.

Hope this helps! Let us know if you have any more questions. 😊

jwee1369 commented 7 months ago

Thanks for the info. Would this apply just for keypoint detection when training square and inferring rectangular - or it's also recommended to pad the rectangular image during detect prediction (if the model is trained with square)

Correct me if i'm wrong but i thought when you train 'square' at 640, 640, if your training dataset is (w,h) = (640, 360) it will automatically pad up so the model will train looking at padded rectangular images that are 640,640? If it's the case, predicting at 640, 640 input and 640 x 360 input should give same results, no? will it be necessary to pad the 640 x 360 input case?

glenn-jocher commented 7 months ago

@jwee1369 hey there! 😊

You're absolutely right! During training, if your dataset contains images with different aspect ratios, YOLOv8 automatically pads them to maintain the square shape (e.g., 640x640) you've specified. This ensures the model learns from consistently shaped inputs, even if the original images are rectangular.

For detection tasks, YOLOv8 is quite robust to changes in aspect ratio during inference. This means you can often get away without padding rectangular images to square, especially if the model was trained on a variety of aspect ratios. However, for keypoint detection tasks, maintaining the aspect ratio becomes more critical due to the precise nature of keypoints. Padding to preserve the training aspect ratio can help maintain accuracy.

In short, for detect predictions, padding isn't strictly necessary but can sometimes help, especially if your training data was very uniform in aspect ratio. For pose estimation tasks, padding to maintain the aspect ratio used during training is more often beneficial.

Here's a quick example for padding during inference, similar to the training phase:

def pad_to_aspect_ratio(image, target_size=(640, 640)):
    # Your padding code here, similar to the training phase
    pass

# Use this function to pad your images before inference

Hope this clarifies things! Let me know if you have any more questions.

jwee1369 commented 6 months ago

@glenn-jocher thanks for the explanation, just to make sure i understand better

Assume (h,w) = (1080, 1920)

  1. if rect=False ie. square image during training (since it's automatically padded up to 1920x1920), it's recommended to pad up during inference as well to 1920x1920 as in training. Would this model perform well to say 3840x2160 padded to 3840x3840 or smaller cases like padded 640x640?

  2. During rect=False, looking at the train_batch*.jpg, I observe that the mosaic images stitched together are scaled down (aspect ratio maintained) a. could this somehow affect the performance b. is there a way to turn off the mosaic pre-processing? Say i just want a single image 1920x1080 with padded gray space to 1920x1920 (instead of the multiple/partial stitching of the mosaic data).

Thanks for the support!

glenn-jocher commented 6 months ago

@jwee1369 hey there! 😊

  1. Yes, if you trained with rect=False (square images) and padded your images to a square size (e.g., 1920x1920), it's a good idea to pad your inference images in the same way. The model should still perform well on different sizes like 3840x3840 or 640x640, as long as you maintain the square shape by padding. The key is consistency between training and inference.

  2. Regarding your observations: a. The mosaic augmentation helps improve model robustness by exposing it to a variety of aspect ratios and scales during training. While it might seem like it could affect performance due to scaling, it generally enhances the model's ability to generalize. b. To disable mosaic augmentation, you can adjust the training configuration. In your data.yaml or directly in the training script, set mosaic: 0 to turn it off. This way, you'll train with single, non-stitched images, padded to maintain the square shape.

Hope this helps clarify things! Let me know if you have any more questions.

jwee1369 commented 6 months ago

Hi @glenn-jocher ,

i've tried padding the input, seems to have about the same performance. Digging deeper into the issue i'm facing, the real problem seems to be in the poor box performance.

The Pose MAP50-95 is good ~0.995 while the Box MAP50-95 is bad ~0.35 after 100 epochs. Visualizing the predictions, the boxes seem a lot smaller than the object while most of the key points are outside of the box (although the key points are predicted accurately close to the ground truth giving it high Pose MAP50-95)

What is odd is that i'm only having this issue with the yolov8n-pose (1920x1920) model, the yolov8n-pose model (640x640) seems to perform well on both Pose and Box MAP50-95. Both train for 100 epochs with ultralytics weights. The other thing I noticed is the Box MAP on the 8n 640p model converges quickly it starts ~0.58 while the 8n 1920p converges slow and starts ~0.15 (this is the opposite with detect models, when i used larger training data images like 1920x1920, it generally performs better).

Would appreciate if you have any experience with these observations

Thanks!

glenn-jocher commented 6 months ago

Hi @jwee1369,

Thanks for sharing your observations! 🤔 It's interesting that you're seeing such a discrepancy in box performance between the two model sizes. Since the keypoints are accurately predicted but the bounding boxes are not, it might suggest an imbalance in how the model is learning to prioritize keypoints over bounding boxes during training.

One approach to consider is adjusting the loss weights for the bounding boxes in the training configuration. This can help the model to pay more attention to improving box predictions. Another angle could be to experiment with different augmentation strategies that might help the model better understand the scale and positioning of objects, especially for the larger 1920x1920 model.

It's also worth noting that larger models might require more epochs to converge fully, especially if the initial MAP is significantly lower. Increasing the number of epochs or adjusting the learning rate schedule could potentially help.

Lastly, ensure that your dataset is consistently annotated, as discrepancies in bounding box annotations could also lead to such issues.

Hope this gives you a few avenues to explore! Let me know how it goes. 😊

jwee1369 commented 6 months ago

Thanks for the suggestion @glenn-jocher . Regarding your point about being consistent with the annotation. When annotating key points, let's say I have an object with 2 keypoints and the frame only has 1 kp (the other out of frame) with 95% of the object showing (the other 5% has the 2nd kp). Would the bounding box only cover the 1 kp that is visible (just enough with a little padding) - i am using CVAT where it automatically just adds a little padding around the keypoints? Or should it cover majority of the object (ie. 95% of it) even though the 1kp only covers a small part of it.

Another example is say i want to detect the head and feet of a human. If the frame only shows most of the body (with the head as the 1st key point), but not the feet. Would my bounding box cover the whole body or just the head (with the first keypoint). If the 2nd keypoint (feet) is visible, the bounding box will cover from head to the feet naturally (again, this is what CVAT annotation tool does)

I have only been annotating my bounding boxes around the visible keypoints, not including other parts of the object (which would be included if there are keypoint visible in that area), so this could be the issue

Thanks!

glenn-jocher commented 6 months ago

@jwee1369 hi there! 😊

Great question! For keypoint-based tasks, it's generally best to have the bounding box cover the entire object as visible in the frame, even if not all keypoints are visible or within that bounding box. This applies to both of your examples: for an object with partially visible keypoints, the bounding box should ideally encompass the whole visible part of the object, not just the area around the visible keypoint(s).

So, for a person where only the head and upper body are visible (with the head being a keypoint), your bounding box should cover the entire visible portion of the body, not just the head. This approach helps the model learn more about the context of each keypoint, potentially improving both detection and pose estimation performance.

It sounds like the annotation strategy could indeed be a contributing factor to the issue you're experiencing. Adjusting your bounding boxes to span the entire visible part of the object might help improve the performance.

# Example: Ideal bounding box covering the entire visible object
bbox = [x_min, y_min, x_max, y_max]  # Covering whole visible area

Hope this helps clear things up! Let me know if you have any more questions.

jwee1369 commented 6 months ago

Thanks @glenn-jocher , makes sense, the CVAT annotation tool indeed was throwing my annotation off, especially the bbox. A side question, does the model first predicts the bbox before the keypoints. Therefore, if the bbox prediction is off, it would consequently affect the keypoint predictions? OR does it predict the keypoints and bbox independently

Thanks!

glenn-jocher commented 6 months ago

Hi @jwee1369!

Yes, typically the model predicts the bounding boxes before the keypoints. The keypoint predictions are conditional on the bounding box since the model narrows down the region of interest for keypoint detection based on the detected bounding box. So, if the bounding box predictions are inaccurate, it can potentially impact the quality of the keypoint predictions as well. However, the two predictions still maintain a level of independence in terms of the specific details they are trained to capture.

Hope this answers your question! 😊

jwee1369 commented 6 months ago

Hi @glenn-jocher ,

Makes sense to have the bbox around the object even if the kps is not in frame.

After fixing the bbox, my results appear to be the same with similar anomaly, where the 640x640 model have good pose and box MAP while the larger 1920x1920 and rect 1920x1080 have good pose with bad box MAP

Could there be a hyper parameter other than the box weight that could be modified to help with this. It seems like there's something that changes in the background with the larger and square images

640x640: 640x640

1920x1080 1920x1080

glenn-jocher commented 6 months ago

@jwee1369 hi there! 😊

Glad to hear you've adjusted the bounding box annotations. Regarding the different performance between the model sizes, it's a bit tricky. Besides adjusting the box weights, consider tuning the learning rate or experimenting with different augmentation strategies. Sometimes, larger models benefit from a slightly lower learning rate or different augmentations to grasp the finer details at higher resolutions.

Another aspect to check is the distribution of your dataset. Ensure that the larger model is exposed to a similar variety of scales and aspect ratios during training, as seen by the smaller model.

Keep exploring, and let me know how it goes!

jwee1369 commented 6 months ago

@glenn-jocher it turns out the problem was in the reg_max val in the Detect module. Since my object spans the width of the image, at 640x640 the bounding box covers the object but not at 1920.

  1. Currently i'm having to change the self.reg_max in the code, I've looked at implementing it in the yaml config file in the Detect line #4693 but i'm getting the following error. How should i define the reg_max in the model config.yaml file?

- [[15, 18, 21], 1, Detect, [nc, reg_max=20]] # Detect(P3, P4, P5)

m_ = nn.Sequential(*(m(*args) for _ in range(n))) if n > 1 else m(*args) # module TypeError: __init__() takes from 1 to 3 positional arguments but 5 were given

  1. Changing the reg_max seems to also increase the number of parameters significantly. Increasing from 16 -> 50 on the Head for yolov8n-pose increased the parameters for that layer from 3M to 4.7M. Is there any way to increase reg_max without increasing the parameters?

  2. Also there is a comment next to the self.reg_max=16, is the self.reg_max getting modified for different model? If so how is this being done? Or is reg_max supposed to be modified in the code manually

DFL channels (ch[0] // 16 to scale 4/8/12/16/20 for n/s/m/l/x)

Thanks

github-actions[bot] commented 5 months ago

👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO 🚀 and Vision AI ⭐

glenn-jocher commented 5 months ago

Hi there! 😊

Great to hear you've pinpointed the issue to the reg_max value in the Detect module. Let's address your questions one by one:

  1. Defining reg_max in the YAML: To define reg_max in your model config.yaml file, you can directly include it within the Detect layer arguments like so:

    - [[15, 18, 21], 1, Detect, nc, 20] # The last item `20` is the `reg_max` value

    However, please ensure that your model's code can handle this additional parameter properly.

  2. Increasing reg_max without parameters blowup: Increasing reg_max indeed impacts the number of parameters due to the design of the detection mechanism. To mitigate this, consider whether all your objects necessitate a high reg_max, or if only specific instances do. Another strategy could be enhancing data preprocessing or augmentation rather than tweaking reg_max.

  3. reg_max across different models: The comment indicates scaling reg_max proportionally with the dataset complexity and model size (n/s/m/l/x). Normally, reg_max modifications should be manual and experimental based on your specific needs and observed model behavior.

I hope these points steer you in the right direction! 🚀 If you have any further questions or need clarification, feel free to ask.

github-actions[bot] commented 4 months ago

👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO 🚀 and Vision AI ⭐