ultralytics / ultralytics

NEW - YOLOv8 🚀 in PyTorch > ONNX > OpenVINO > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
28.17k stars 5.6k forks source link

YoloV8 Kepoint 3D #14883

Open sheyteo opened 1 month ago

sheyteo commented 1 month ago

Search before asking

Question

Hey there, i was wondering whether it is possible to use the third coordinate fom a keypoint as depth instead of visibility. I have generated data, which is in the following layout. <class-index> <x> <y> <width> <height> <px1> <py1> <pz1> <px2> <py2> <pz2> ..., where x is the depth scaled from 0 to 1, where 0 is the "closest" a keypoint can be to the observer and 1 the "furthest" a keypoint can be to the observer. The observers envoirment and positon are always the same, so it should be able to learn the 3D aspect. I would like to know wether it is possible to do this with YoloV8-Pose or in some other way using YoloV8

I already tried to change the loss myself but until know ive had no success.

Here is what I tried :

Edited the KeypointLoss in ultralytics/utils/loss.py (to add loss for the z-coordinate) :

def forward(self, pred_kpts, gt_kpts, kpt_mask, area):
    d = (pred_kpts[..., 0] - gt_kpts[..., 0]) ** 2 + (pred_kpts[..., 1] - gt_kpts[..., 1]) ** 2 + (pred_kpts[..., 2] - gt_kpts[..., 2]) ** 2

and removed the clipping of the keypoints to 0 in ultralytics/engine/results.py:

#if keypoints.shape[2] == 3:  # x, y, conf
#    mask = keypoints[..., 2] < 0.5  # points with conf < 0.5 (not visible)
#    keypoints[..., :2][mask] = 0 

In ultralytics/utils/loss.py I additionaly also edited the keypoint selection mask from

#kpt_mask = gt_kpt[..., 2] != 0 if gt_kpt.shape[-1] == 3 else torch.full_like(gt_kpt[..., 0], True)

to

kpt_mask = torch.full_like(gt_kpt[..., 0], True)

Lastly i tried to remove the kpts_obj_loss in ultralytics/utils/loss.py as its only used when 3d points are used :

if pred_kpt.shape[-1] == 3:
    kpts_obj_loss = self.bce_pose(pred_kpt[..., 2], kpt_mask.float())  # keypoint obj loss

But after training and with all these modifications the z coordinate is still incorrect and always 0.5, even when predicting an image from the training dataset. Is it possible to generate 3D-Coordinates ?

Additional

My Objects are colored based how far they are from the observer, closest to the observer = (green channel = 0) and furthest to the observer = (green channel = 255) The rest of the environment all have (green channel = 0)

No response

github-actions[bot] commented 1 month ago

👋 Hello @sheyteo, thank you for your interest in Ultralytics YOLOv8 🚀! We recommend a visit to the Docs for new users where you can find many Python and CLI usage examples and where many of the most common questions may already be answered.

If this is a 🐛 Bug Report, please provide a minimum reproducible example to help us debug it.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset image examples and training logs, and verify you are following our Tips for Best Training Results.

Join the vibrant Ultralytics Discord 🎧 community for real-time conversations and collaborations. This platform offers a perfect space to inquire, showcase your work, and connect with fellow Ultralytics users.

Install

Pip install the ultralytics package including all requirements in a Python>=3.8 environment with PyTorch>=1.8.

pip install ultralytics

Environments

YOLOv8 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

Ultralytics CI

If this badge is green, all Ultralytics CI tests are currently passing. CI tests verify correct operation of all YOLOv8 Modes and Tasks on macOS, Windows, and Ubuntu every 24 hours and on every commit.

sheyteo commented 1 month ago

Thank your for your suggestions. I would like to understand the loss for the keypoints.

kpt_mask = torch.full_like(gt_kpt[..., 0], True) # Modified to support x,y,z, where z = 0
kpts_loss = self.keypoint_loss(pred_kpt, gt_kpt, kpt_mask, area)  # pose loss

if pred_kpt.shape[-1] == 3:
    kpts_obj_loss = self.bce_pose(pred_kpt[..., 2], kpt_mask.float())  # keypoint obj loss

Currently the x and y loss are calculated in the following function

def forward(self, pred_kpts, gt_kpts, kpt_mask, area):
     """Calculates keypoint loss factor and Euclidean distance loss for predicted and actual keypoints."""
    d = (pred_kpts[..., 0] - gt_kpts[..., 0]).pow(2) + (pred_kpts[..., 1] - gt_kpts[..., 1]).pow(2)
    kpt_loss_factor = kpt_mask.shape[1] / (torch.sum(kpt_mask != 0, dim=1) + 1e-9)
    # e = d / (2 * (area * self.sigmas) ** 2 + 1e-9)  # from formula
    e = d / ((2 * self.sigmas).pow(2) * (area + 1e-9) * 2)  # from cocoeval
    return (kpt_loss_factor.view(-1, 1) * ((1 - torch.exp(-e)) * kpt_mask)).mean()

And the visibilty/confidence of a keypoint using the bce_pose loss

kpts_obj_loss = self.bce_pose(pred_kpt[..., 2], kpt_mask.float())

Does caluclating the loss of the x,y,z keypoints, require a 4th value for confidence ?

Or is it fine to just omit the kpts_obj_loss ?

Furthermore i am curios wether the keypoint_loss can just be extend to use 3d keypoints, from d = (pred_kpts[..., 0] - gt_kpts[..., 0]).pow(2) + (pred_kpts[..., 1] - gt_kpts[..., 1]).pow(2) to d = (pred_kpts[..., 0] - gt_kpts[..., 0]).pow(2) + (pred_kpts[..., 1] - gt_kpts[..., 1]).pow(2) + (pred_kpts[..., 2] - gt_kpts[..., 2]).pow(2)

I also want to mention, that there are changes to the trainer and the augment process, to ensure valid images, containing enough info and not cutting out keypoints where visible/z < 0.5.

sheyteo commented 1 month ago

I have now changed the loss function and tried to train the network multiple times. After the training the x and y coordinates are correct but the z-value is always around 0.6. This is usually a sign for the model not being able to learn the z-aspect. So I investigated if there was any problem (information loss) in preprocess. But the information that is passed to the model should be sufficient to learn the z-coordinate.

The pose_loss is going down as usual, even if I tried to log the loss only for the z-value

d = (pred_kpts[..., 2] - gt_kpts[..., 2]).pow(2)

kpt_loss_factor = kpt_mask.shape[1] / (torch.sum(kpt_mask != 0, dim=1) + 1e-9)

e = d / ((2 * self.sigmas).pow(2) * (area + 1e-9) * 2)  # from cocoeval

print(kpt_loss_factor.view(-1, 1) * ((1 - torch.exp(-e)) * kpt_mask)).mean().item())

Over the course of the training this value also converged to zero, so i thought the model was learning the z-value. But then I also logged : (pred_kpts[..., 2] - gt_kpts[..., 2]).abs().mean() which was around 0.3, so the model still doesn't learn. So my Questions here are is the model architecture unable to propagate the color aspect or could the loss still be flawed.

Additionally I want to know what this line selected_keypoints /= stride_tensor.view(1, -1, 1, 1) in ultralytics/utils/loss.py v8PoseLoss::calculate_keypoints_loss, does regarding the third coordinate. Before dividing the selected keypoints the z-values are values between 0 and 1 from the dataset. Do i have to omit this for the z-value like selected_keypoints[:,:,:,:2] /= stride_tensor.view(1, -1, 1, 1) or what does this line do ?

Thank you so much for taking the time to respond to my long Questions.