ultralytics / ultralytics

NEW - YOLOv8 🚀 in PyTorch > ONNX > OpenVINO > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
24.46k stars 4.86k forks source link

YOLOv8 for RGBD data #3432

Open ZuzannaChalupka opened 11 months ago

ZuzannaChalupka commented 11 months ago

Search before asking

Question

Hello Joche, I have a question. I would like to use Yolov8 for RGBD data, but don't know how to exactly do that. I added a line: "ch: 4" in my yaml file, but when I wanted to use my model for RGBD photo I have this error: "RuntimeError: Given groups=1, weight of size [16, 3, 3, 3], expected input[1, 4, 480, 640] to have 3 channels, but got 4 channels instead.". Can you help me?

Additional

No response

github-actions[bot] commented 11 months ago

👋 Hello @zuza123, thank you for your interest in YOLOv8 🚀! We recommend a visit to the YOLOv8 Docs for new users where you can find many Python and CLI usage examples and where many of the most common questions may already be answered.

If this is a 🐛 Bug Report, please provide a minimum reproducible example to help us debug it.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset image examples and training logs, and verify you are following our Tips for Best Training Results.

Join the vibrant Ultralytics Discord 🎧 community for real-time conversations and collaborations. This platform offers a perfect space to inquire, showcase your work, and connect with fellow Ultralytics users.

Install

Pip install the ultralytics package including all requirements in a Python>=3.7 environment with PyTorch>=1.7.

pip install ultralytics

Environments

YOLOv8 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

Ultralytics CI

If this badge is green, all Ultralytics CI tests are currently passing. CI tests verify correct operation of all YOLOv8 Modes and Tasks on macOS, Windows, and Ubuntu every 24 hours and on every commit.

glenn-jocher commented 11 months ago

@zuza123 hi there! Thank you for reaching out with your question about using YOLOv8 for RGBD data. While YOLOv8 is primarily designed for RGB data, it is possible to modify it to work with RGBD data.

The error you encountered is because the YOLOv8 model is expecting input with 3 channels (for RGB data) but received input with 4 channels (for RGBD data). To use YOLOv8 with RGBD data, you will need to modify the model architecture by adjusting the number of input channels in the model's first layer.

To do this, you can open the YOLOv8 model architecture file and modify the number of input channels from 3 to 4. Once you have made this change, you should be able to use the modified YOLOv8 model for RGBD data.

I hope this explanation helps! If you have any further questions, please feel free to ask.

ZuzannaChalupka commented 10 months ago

Hi, thank you for replying. I tried to find YOLOv8 model architecture file but unfortunately, I couldn't. May I ask to link to this file? Thank you in advance

glenn-jocher commented 10 months ago

@zuza123 hi there! Thank you for reaching out. The YOLOv8 model architecture file can be found in the YOLOv8 repository. You can navigate to the repository and locate the file named yolov8.py or a similar file that contains the YOLOv8 model architecture.

Once you have found the file, you can open it and make the necessary modifications to the number of input channels in the first layer to accommodate RGBD data.

If you have any further questions or need any additional assistance, please feel free to ask. I'm here to help!

ammar-deep commented 10 months ago

Considering the below model architecture file, where exactly should we change the input channels from 3 to 4?

# Ultralytics YOLO 🚀, AGPL-3.0 license
# YOLOv8 object detection model with P3-P5 outputs. For Usage examples see https://docs.ultralytics.com/tasks/detect

# Parameters
nc: 80  # number of classes
scales: # model compound scaling constants, i.e. 'model=yolov8n.yaml' will call yolov8.yaml with scale 'n'
  # [depth, width, max_channels]
  n: [0.33, 0.25, 1024]  # YOLOv8n summary: 225 layers,  3157200 parameters,  3157184 gradients,   8.9 GFLOPs
  s: [0.33, 0.50, 1024]  # YOLOv8s summary: 225 layers, 11166560 parameters, 11166544 gradients,  28.8 GFLOPs
  m: [0.67, 0.75, 768]   # YOLOv8m summary: 295 layers, 25902640 parameters, 25902624 gradients,  79.3 GFLOPs
  l: [1.00, 1.00, 512]   # YOLOv8l summary: 365 layers, 43691520 parameters, 43691504 gradients, 165.7 GFLOPs
  x: [1.00, 1.25, 512]   # YOLOv8x summary: 365 layers, 68229648 parameters, 68229632 gradients, 258.5 GFLOPs

# YOLOv8.0n backbone
backbone:
  # [from, repeats, module, args]
  - [-1, 1, Conv, [64, 3, 2]]  # 0-P1/2
  - [-1, 1, Conv, [128, 3, 2]]  # 1-P2/4
  - [-1, 3, C2f, [128, True]]
  - [-1, 1, Conv, [256, 3, 2]]  # 3-P3/8
  - [-1, 6, C2f, [256, True]]
  - [-1, 1, Conv, [512, 3, 2]]  # 5-P4/16
  - [-1, 6, C2f, [512, True]]
  - [-1, 1, Conv, [1024, 3, 2]]  # 7-P5/32
  - [-1, 3, C2f, [1024, True]]
  - [-1, 1, SPPF, [1024, 5]]  # 9

# YOLOv8.0n head
head:
  - [-1, 1, nn.Upsample, [None, 2, 'nearest']]
  - [[-1, 6], 1, Concat, [1]]  # cat backbone P4
  - [-1, 3, C2f, [512]]  # 12

  - [-1, 1, nn.Upsample, [None, 2, 'nearest']]
  - [[-1, 4], 1, Concat, [1]]  # cat backbone P3
  - [-1, 3, C2f, [256]]  # 15 (P3/8-small)

  - [-1, 1, Conv, [256, 3, 2]]
  - [[-1, 12], 1, Concat, [1]]  # cat head P4
  - [-1, 3, C2f, [512]]  # 18 (P4/16-medium)

  - [-1, 1, Conv, [512, 3, 2]]
  - [[-1, 9], 1, Concat, [1]]  # cat head P5
  - [-1, 3, C2f, [1024]]  # 21 (P5/32-large)

  - [[15, 18, 21], 1, Detect, [nc]]  # Detect(P3, P4, P5)
glenn-jocher commented 10 months ago

@ammar-deep to change the number of input channels from 3 to 4 in this YOLOv8 model architecture, you need to make modifications in two places:

  1. In the backbone section:

    • Find the line [-1, 1, Conv, [64, 3, 2]] which represents the first convolutional layer in the backbone.
    • Change 3 to 4 in the arguments of this line. The modified line would be [-1, 1, Conv, [64, 4, 2]].
  2. In the head section:

    • Find the line [-1, 1, Conv, [256, 3, 2]].
    • Change 3 to 4 in the arguments of this line. The modified line would be [-1, 1, Conv, [256, 4, 2]].

After making these changes, the YOLOv8 model will be able to accept input with 4 channels, which can be used for RGBD data.

Please note that these modifications assume that the rest of the model and the associated code have been appropriately modified to handle RGBD data.

ammar-deep commented 10 months ago

@glenn-jocher Aint 3 in [-1, 1, Conv, [64, 3, 2]] represent the kernel size and not the input channels?

panchasan commented 10 months ago

@glenn-jocher hi, I modified the yaml file, as you proposed and got an errror: RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 17 but got size 16 for tensor number 1 in the list. You also wrote about changes in the model architecture, is there another file except yolov8n.yaml that users can modify?

panchasan commented 10 months ago

After a few tests, I found out, that there is another problem. I have 4 channel images, but it looks like only three channels go to the model. When I read 4 channel image with OpenCV, I got 3 channels. I have to use cv2.IMREAD_UNCHANGED to get 4 channel image

glenn-jocher commented 10 months ago

@panchasan, you're correct. By default, OpenCV reads images in the RGB format, which includes three channels. If you want to read a 4 channel image (such as RGBD or RGBA), you must indeed use the cv2.IMREAD_UNCHANGED flag. This tells OpenCV to keep the original color format of the image, including any alpha or depth channel(s).

Remember that, in addition to using this flag when reading the image, you must ensure that your model is configured to accept 4 channel input, which includes modifying the YOLOv8 model architecture to receive a 4 channel input instead of the usual 3.

Please ensure that these changes are reflected not only in the reading of the image file, but also in the corresponding pre-processing, data augmentation, and model training parts. Otherwise, the model may still malfunction or not perform as expected.

Let me know how that works for you, and don't hesitate to follow up with more details or questions!

panchasan commented 10 months ago

Hi, I think I made changes that let YOLOv8 work with 4 channels. Firstly, changes in yaml file that @glenn-jocher proposed give me an error that I have mentioned before. Changes I made:

glenn-jocher commented 10 months ago

Hi @panchasan, great work on customizing YOLOv8 for 4 channels! Your approach of updating the number of channels in the YAML file, reading the image with the cv2.IMREAD_UNCHANGED flag, modifying the transforms and adjusting the image plot seams on point.

There are a couple points you need to keep in mind though:

  1. Changing the input channels to 4 in the YAML configuration file is only part of the work. As well as this, you need to make sure that the model architecture is updated to actually accept 4 channel inputs. Otherwise, the model would still expect 3-channel images, which would produce errors during training or inference.
  2. When modifying parts of the code, it is critical not to break other functionalities. Hence, please ensure that the data transformations and image saving still work as expected, not only for 4-channel images but also for standard 3-channel ones.
  3. Since updating the model could affect its overall performance, I would like to suggest re-checking all of the model performance measurements to ensure they remain accurate.

I like your approach and it seems you're on the right track. Keep in mind testing on various datasets and tweaking your changes as necessary to ensure accurate and robust results. Good luck with your future work with 4-channel data on YOLOv8!

Guydada commented 10 months ago

@glenn-jocher hi, I modified the yaml file, as you proposed and got an errror: RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 17 but got size 16 for tensor number 1 in the list. You also wrote about changes in the model architecture, is there another file except yolov8n.yaml that users can modify?

Did you manage to solve this error ? I am getting it too

panchasan commented 10 months ago

@Guydada no, I did not change these elements. In the yaml file I just added the line ch: 4.

panchasan commented 10 months ago

Hi, I think I made changes that let YOLOv8 work with 4 channels. Firstly, changes in yaml file that @glenn-jocher proposed give me an error that I have mentioned before. Changes I made:

  • .yaml I added -> ch: 4
  • in every file in ultralytics/data, I changed cv2.imread('photo.png') to cv2.imread('photo.png', cv2.IMREAD_UNCHANGED)
  • in file in ultralytics/data/base.py i changed function def build_transforms(self, hyp=None) by adding return Compose([ToTensor()])
  • I changed file plotting.py in the same folder as before. The section I changed was #Build Image where I modified lines below (G, B, D, R) = cv2.split(im) merged = cv2.merge([R, G, B]) mosaic[y:y + h, x:x + w, :] = merged In this section, the split is specified in my case. I am not sure if it was all changes I made because it was a little chaotic. The biggest change was ch: 4, adding cv2.IMREAD_UNCHANGED. Then I blocked all transforms and in the end, I fixed saving images to visualize prediction.

I made those changes one more time and collect all changes on the pictures below: Zrzut ekranu z 2023-07-19 17-40-19 Zrzut ekranu z 2023-07-19 17-32-51 Zrzut ekranu z 2023-07-19 17-28-05 Zrzut ekranu z 2023-07-19 17-27-57 Zrzut ekranu z 2023-07-19 17-23-19 Zrzut ekranu z 2023-07-19 17-19-39 Zrzut ekranu z 2023-07-19 17-06-55

glenn-jocher commented 10 months ago

@panchasan hello, thanks for documenting your changes here for us. The modifications you've made so far seem to be on track for handling 4-channel images with YOLOv8. Incorporating the depth channel (or any other additional channel) into a model traditionally designed for 3 channels (RGB) can appear chaotic, but your steps appear to be well-thought out.

  1. You've updated the number of channels in the YAML file.
  2. You've changed the method of reading images in the dataset to include the 4th channel.
  3. You've redefined the build_transforms() function to account for 4-channels in the transformations that'll be applied to your input data.
  4. You adapted the image plotting function to accommodate the 4th channel in the image splits

It seems like the key changes for supporting 4-channel input have been addressed. You're right in saying the major alteration boils down to changing the 'ch' parameter and reading in the image using 'IMREAD_UNCHANGED'. After this there are just alterations in handling the image data (like in transforms and plotting) that cater to this 4-channel input.

It's great that you've also checked the transformations and the visualization of the input. These steps are often overlooked, but they play a crucial part in effectively training and evaluating your model.

One final word - it's crucial to revaluate your model's performance as changing the number of input channels could affect it. Also be ready to fine-tune your model accordingly.

Keep it up!

lesreaper commented 10 months ago

Does anyone have a code commit with the full changes to access 4-channels?

@panchasan @Guydada @glenn-jocher

glenn-jocher commented 10 months ago

@lesreaper hello! I'm afraid we don't currently have a specific code commit that shows the full implementation for 4-channel input. However, I can outline the steps that were discussed for reference:

  1. Update the 'ch' parameter in the YAML configuration file to 4, to signify the additional channel.
  2. Use the 'cv2.IMREAD_UNCHANGED' flag when reading the images. This ensures that OpenCV retains all channels in the image.
  3. Modify the 'build_transforms()' function so it correctly handles 4-channel inputs.
  4. Adjust the method of splitting and merging the image channels in the plotting feature to cater for the 4-channel input.

It's important to stress that you should carefully test each of these steps to ensure that they work correctly with your specific dataset. Remember also to monitor and potentially adjust your model's performance, since adding an extra channel can have an impact on this.

Note that these steps were only discussed as a potential solution for handling a 4-channel input, and haven't been thoroughly tested in the context of YOLOv8. They should provide a starting point, but further adaptations may be needed to get the model working optimally with 4-channel data.

Good luck with your modifications!

lesreaper commented 10 months ago

Thank you!

We're in the middle of testing this out. Since our 4th channel with alter the semantic meaning of previous images, are there good instructions on training a system from scratch for something like pose if we're using the 4 channels?

[EDIT]: It looks like we have our training done (perhaps). I attempt to run the predict model using the new weights from the run folder, either using Python or the CLI, and I get this error:

RuntimeError: Given groups=1, weight of size [16, 4, 3, 3], expected input[1, 3, 640, 640] to have 4 channels, but got 3 channels instead

I'm not sure where I'm missing a channel or something in the predictors. Is there a place we missed if we hit all the issues above @glenn-jocher ?

glenn-jocher commented 10 months ago

@lesreaper hello,

Glad to hear that you're making progress on implementing 4-channel support in YOLOv8.

  1. For training a system like pose using 4-channels, the process wouldn't be much different from your standard training process. The only difference is going to be your input data, which now has an extra channel. Make sure your dataset's images are of the form (width, height, 4) and your labels are correctly generated for your pose estimation task.

  2. Regarding the runtime error: It seems like when you're running prediction, you're still using 3-channel images. The error indicates that the model's convolutional weights were trained expecting 4-channels (which is correct based on your modifications), but the input being supplied during prediction time has only 3 channels. You need to ensure that your prediction images also have 4 channels (similar to the training set).

Check where you're loading your data for prediction and ensure that those images also have an extra (fourth) channel, much like your training images.

I hope this helps, and best of luck with your project!

lesreaper commented 10 months ago

Hi @glenn-jocher , I'm still not able to to get the images to work properly.

  1. All the labeling and pre-checks look good. Used a Jupyter Notebook to confirm each image with the keypoints on both a standard and custom dataset.
  2. I am testing one of the images I trained on, so I know it's 4-channel.

I'm wondering how the prediction class differs from the training that there is a discrepancy.

I added the cv2.IMREAD_UNCHANGED to every imread I can find. Is there anywhere else in the prediction flow there might be another need to modify the code than what is above?

panchasan commented 10 months ago

Hi @glenn-jocher , I'm still not able to to get the images to work properly.

  1. All the labeling and pre-checks look good. Used a Jupyter Notebook to confirm each image with the keypoints on both a standard and custom dataset.
  2. I am testing one of the images I trained on, so I know it's 4-channel.

I'm wondering how the prediction class differs from the training that there is a discrepancy.

I added the cv2.IMREAD_UNCHANGED to every imread I can find. Is there anywhere else in the prediction flow there might be another need to modify the code than what is above?

RuntimeError: Given groups=1, weight of size [16, 4, 3, 3], expected input[1, 3, 640, 640] to have 4 channels, but got 3 channels instead

I also had that error. The solution was adding cv2.IMREAD_UNCHANGED to every imread. After that there was no error about three channels. Have you changed the model config file by adding ch: 4?

glenn-jocher commented 10 months ago

Hi @lesreaper,

It appears you're on the right track in modifying the process to accommodate 4 channels. Though you have modified all instances of cv2.imread with the cv2.IMREAD_UNCHANGED flag and it has successfully worked for the training phase, there could be an additional location in the prediction phase where an image is read in or manipulated that needs to be adapted to the 4-channel format.

The error message indicates that the prediction phase is still receiving a 3-channel image while it is expecting a 4-channel one due to the modified model architecture.

Remember that you also need to handle 4-channel images where the model is inferencing, this includes not just image loading but also any image processing operation that might be reducing the channel number.

Check all image handling operations in the prediction pipeline, including any resizing, normalizing, or transforming operations, to ensure they are also equipped to handle and maintain the 4 channels in the images.

This could be a tedious process but ensures that the model is receiving images in the format it expects at all stages, not just during training.

Good luck with your debugging!

lesreaper commented 9 months ago

Thanks for all the input @glenn-jocher @panchasan

I found part of the culprit.

So when I leave this line in ultralytics/engine/predictor.py Line 237 to 3, it works to train:

        if not self.done_warmup:
            self.model.warmup(imgsz=(1 if self.model.pt or self.model.triton else self.dataset.bs, 3, *self.imgsz)) # CHANGE HERE
            self.done_warmup = True

But I get this error when running inference:

RuntimeError: Given groups=1, weight of size [16, 4, 3, 3], expected input[1, 3, 640, 640] to have 4 channels, but got 3 channels instead

So, to run inference, I have to change it to 4 channels like this:

        if not self.done_warmup:
            self.model.warmup(imgsz=(1 if self.model.pt or self.model.triton else self.dataset.bs, 4, *self.imgsz)) # CHANGE HERE
            self.done_warmup = True

However, if I run training with the value as 4 I get this error:

RuntimeError: Given groups=1, weight of size [16, 3, 3, 3], expected input[1, 4, 640, 640] to have 3 channels, but got 4 channels instead

The training is clearly getting the 4 values, which is is supposed to. But why is it only expecting 3 IF I've gone through and already made all the changes suggested above. Any thoughts on why this might be?

It also SEEMS to be saving the models with 4 channels as input if it can inference 4 channels when running the predictor, right? With my model, the predict will take both a 3 or 4 channel image, but the 4 channel image is more accurate.

I guess the question is whether the model is actually using that 4th channel properly if I have to constantly change this back and forth. It doesn't make sense I should, but maybe...

glenn-jocher commented 9 months ago

Hello @lesreaper,

The model's architecture changes according to the number of input channels specified in the .yml configuration file, due to which your model seems to be saving with 4 channel inputs. However, the warmup function is not directly related to your dataset or the changes you made in the .yml file.

This warmup phase is not involved in the actual learning process but rather adjusts factors like the batch size and GPU adaptation. By specifying3 in the model warmup dimensions, you're not feeding the model with 3-channel images for training, but rather instructing it to prepare for such data.

That's why, when you switch to inference, the model is expecting 4-channel input, which is consistent with the rest of the modifications you've made to account for the fourth channel.

The reason you're seeing improved performance with 4-channel input implies that the model is indeed learning from the fourth channel, which is good news for your project!

As for why you're getting an error when you change your warmup dimensions to 4 channels, it might be a configuration mismatch. Try ensuring that you've updated all configuration values that might influence this behavior - not just those related to training, but also factors like the model definition and warm-up phase.

This doesn't feel like the expected behavior, but it's possible that there is a mismatch somewhere that's causing this issue. I would suggest a thorough check of all your modifications and see if anything has been overlooked.

Good luck, and don't hesitate to ask if you have any further questions!

john-lihn commented 8 months ago

Hello, I found that I also encountered the same problem when training multi-channel. The source came from verifying the initialization setup_model in predictor.py and setting the weight directly loaded in AutoBackendto the input dimension 3 channel instead of multi-channel, which led to serious errors. , do I need to load the weights during verification to avoid similar incidents? How can I solve this problem? Or how should I continue to use the trained model instead of reloading it? However, during subsequent verification, the data was corrected to 3-channel data, which confused us.

image

glenn-jocher commented 8 months ago

@john-lihn hello,

The issue you've experienced seems to stem from having different channel dimensions during training and verification. This is due to the setup_model function in predictor.py is setting the loaded weights to a hard-coded 3-channel input, while you're working with multi-channel images.

To use your trained weights for subsequent runs without reloading them, you would need to ensure that these weights are saved and loaded as per the multi-channel configuration. The key would be to align the expected channel dimensions throughout all the stages in your pipeline, not just for training, but also for model warmup, validation and inference.

Regarding the discrepancy observed during validation with data corrected to 3-channel, remember that the model has been trained to understand and work with multi-channel images. Hence, if you're giving the model a 3-channel image during validation, it might not behave as intended. It's crucial to maintain consistency in the number of channels throughout your entire pipeline, from training to validation and testing.

It seems like there may be a need for you to modify the setup_model function too, to take into account the number of channels your data actually has, as opposed to defaulting to 3. You want to ensure that the model weights are initialized correctly respecting the number of channels you're working with.

I hope this information helps you to move forward with your project. Feel free to ask further questions if anything is unclear.

lesreaper commented 8 months ago

Has anyone been able to get a decent model out of this multi-channel process?

Whenever I try to use old weights for pose models and transfer them, I get errors on the expected 3 channels but got 4. Building from scratch gives horrible results.

I have a dataset with 500,000 images in it that are 4-channel RGBa, so I think that would be enough to get decent results. According to the data from the results printout, the loss for the pose is under 2 and flatlines after 15 epochs or so, and the mAP is 0.98, which is pretty high.

This is the only training script I can get to even start training with all the changes listed above:

from ultralytics import YOLO

# Load a model
model = YOLO('yolov8s-pose.yaml')  # build a new model from YAML
# model = YOLO('yolov8s-pose.pt')  # load a pretrained model (recommended for training)
# model = YOLO('yolov8s-pose.yaml').load('yolov8n-pose.pt')  # build from YAML and transfer weights

# Train the model
results = model.train(data='coco8-pose.yaml', epochs=100, imgsz=640, device=0)

I then run a test image using the best.pt from the trainx folder, and the results are horrible. This is the test script:

from ultralytics import YOLO
import cv2
import numpy as np
# Load a pretrained YOLOv8n model
model = YOLO('runs/pose/train24/weights/best.pt')

# Define path to the image file
source = cv2.imread('image_set/combined/output.png', cv2.IMREAD_UNCHANGED)

# height, width = 513, 1102
height = 480
width = 640
scale_factor = min(640.0 / width, 640.0 / height)

# Compute new size
new_width = int(width * scale_factor)
new_height = int(height * scale_factor)

# Pad the image to make it square (640x640)
top = bottom = (640 - new_height) // 2
left = right = (640 - new_width) // 2
font = cv2.FONT_HERSHEY_SIMPLEX

pairs = [(0, 1), (0, 2), (1, 3), (2, 4), (5, 7), (7, 9), (6, 8), (8, 10), (5, 6), (5, 11), (6, 12), (11, 12), (11, 13), (13, 15), (12, 14), (14, 16)]

# Run inference on the source
 # Run YOLO object detection
resized = cv2.resize(source, (new_width, new_height))
padded_img = cv2.copyMakeBorder(resized, top, bottom, left, right, cv2.BORDER_CONSTANT, value=[0, 0, 0])

results = model(source)  # list of Results objects

keypoints = results[0].keypoints  # Masks object

keypoints.xy  # x, y keypoints (pixels), (num_dets, num_kpts, 2/3), the last dimension can be 2 or 3, depends the model.
keypoints.xyn  # x, y keypoints (normalized), (num_dets, num_kpts, 2/3)
keypoints.conf  # confidence score(num_dets, num_kpts) of each keypoint if the last dimension is 3.
keypoints.data  # raw keypoints tensor, (num_dets, num_kpts, 2/3)

export = []

for r in results:
    if r.keypoints.xy is None:
        continue

    if r.keypoints.conf is None:
        continue

    keypoints = r.keypoints.xy.cpu().numpy()  # Convert keypoints data to numpy array
    confidences = r.keypoints.conf.cpu().numpy()  # Convert confidence scores to numpy array

    # Iterate over each detected object's keypoints
    for keypoints, confidences in zip(keypoints, confidences):
        # Iterate over each keypoint
        for (x, y), conf in zip(keypoints, confidences):
            print(100, x, y, conf)
            # Only draw keypoints with confidence above 0.5
            if conf < 0.5:
                continue  # skip this low confidence detection
            else:
                # Draw keypoints
                cv2.circle(source, (int(x), int(y)), 5, (0, 255, 0), thickness=-1)
        # Draw skeleton
        for i, j in pairs:
            if confidences[i] > 0.5 and confidences[j] > 0.5:  # only if both keypoints have good confidence
                x1, y1 = keypoints[i]
                x2, y2 = keypoints[j]
                cv2.line(source, (int(x1), int(y1)), (int(x2), int(y2)), (0, 255, 0), 2)

    flattened_keypoints = np.around(keypoints.flatten()).astype(int).tolist()
    export.append({
        "class": "person",
        "keypoints": flattened_keypoints
    })

# Display image with detections            
cv2.imshow('Keypoints and Skeleton', source)
cv2.waitKey(0)
cv2.destroyAllWindows()
glenn-jocher commented 8 months ago

@lesreaper hello,

It seems like you’re having some difficulty training a model for pose estimation with multi-channel data. You mentioned that you get an error related to mismatching channels when you try to transfer old weights, but that training from scratch yields poor-quality results.

The error about expected channels is likely due to the weights you're trying to transfer from old models. These weights would have been trained on three-channel (RGB) images, and are therefore incompatible with the four-channel (RGBA) images you're now using.

That said, the poor results from training from scratch indicate that the model may not be learning effectively from the RGBA data. Given the size of your dataset, it should be feasible to train a robust model assuming all other factors, such as data distribution, labeling accuracy, model complexity, and training specifications, are favorable.

Your training script appears fine, but for troubleshooting, you might want to check if the issue persists when using a simpler model or when reducing the number of epochs.

In addition, while you mentioned that the mAP is high (0.98) and loss is low (under 2), the results are still poor. This suggests that there might be an issue with how well the model generalizes to new, unseen data. A high training performance (low loss, high mAP) but poor test performance often signals overfitting. This can potentially be mitigated by techniques such as regularization, dropout layers, and data augmentation.

It could also be useful to separately verify if the keypoints and skeleton drawing code is working as intended, perhaps by providing it with manually created dummy data to confirm it works as expected.

Remember, achieving a good balance between model complexity, data quality, and training specification is key to training a successful model. Hope these insights help you to investigate further and solve the problem. Please feel free to ask if you have more questions.

john-lihn commented 8 months ago

Hello, I have successfully run it recently. I will give a brief explanation of the modification content. Because training is divided into training and verification, and the original training can be verified because the band fails, then I will force the mode of dataset_build to train, so There is no preprocessing problem anymore, and the problems of original band inconsistency and different shapes that cannot be concat are solved together. However, when verifying the map and other information, the training reduction ratio was not displayed, so I entered None in this field and entered the calculation directly, and the problem was solved. Thank you for your explanation.

In addition, for the output model, we only modify the input layer. Can we input only the data behind the input layer? If it is feasible, those who are not RGB channel users can use pre-train.

glenn-jocher commented 8 months ago

@john-lihn hello!

I'm glad to hear that you were able to successfully run the training and verification stages by ensuring the mode of dataset_build is set to 'train'. That's a good way to prevent preprocessing problems and resolve the issue of band inconsistency and differing shapes.

As for the missing training reduction ratio during verification, inputting 'None' seems to have worked for you. This kind of adjustment can sometimes solve issues when certain parameters aren't needed or applicable for a specific stage in the process.

As to your question about the output model: in theory, it may be possible to input data directly to layers beyond the input layer. However, keep in mind this will likely require modifications to the model architecture and possibly the training pipeline.

It's a bit like taking a shortcut directly to a later stage in the process. Just bear in mind that each layer in a model is designed to progressively transform the input in a way that helps the model learn how to achieve its task. By skipping the initial stages, you may risk losing valuable transformations that the model relies on to learn effectively, especially for complex tasks.

Finally, about using pre-trained weights with non-RGB channels, this might be challenging. Pre-trained networks have learned representations from RGB images, and directly applying these weights to images with a different number of channels might not give the expected results. Training from scratch or fine-tuning with your specific multi-channel data might yield better results.

I hope this helps to clarify your queries. Please feel free to reach out if you have further questions.

lesreaper commented 8 months ago

Hi @john-lihn or @glenn-jocher , do you have code examples to what you are referring?

I made all the changes as mentioned above, but it seems you're referring to a number of things not covered in the answers above.

For instance, there is no variable dataset_build in the codebase, so I don't know how to set it's mode to train.

Also, what is the training reduction ratio?

glenn-jocher commented 8 months ago

@lesreaper hello,

Thank you for reaching out and for your patience in working through these issues. While I can't provide specific code examples, I'm happy to explain conceptually what was mentioned in the previous posts.

First, regarding the dataset_build reference: this term doesn't refer to a specific variable in the codebase, it is representing the process of the dataset preparation before training. The suggestion is to ensure that the entire dataset, both training and validation sections, undergoes the same preprocessing steps. This is crucial because discrepancies in preprocessing can lead to various inconsistencies when training the model.

Secondly, the training reduction ratio is a hyperparameter that is typically used in model training to adjust the learning rate. By reducing the learning rate, we can prevent the model from learning too quickly and potentially overshooting the optimal weights during training. Although, this might not be applicable in every use case, as it greatly depends on your dataset and the specific architecture of your model.

Remember, in machine learning, proper data preparation and choosing appropriate hyperparameters are both crucial to creating an effective model. If you have further questions about these or any other terms or processes, please don't hesitate to ask.

Best of luck with your implementation of YOLOv8!

john-lihn commented 8 months ago

image image image Hi, @lesreaper Sorry, the build_dataset was in the wrong order. I set it to train and then added its internal function. All data preprocessing will verify that the training preprocessing is unified. Below is the full name of ratio. You can refer to it.

During training, if there are problems such as verification and data format inconsistency, you can put the data batch['img']. XXX your type, just fine. This is what I discovered during the final debugging. I hope you can solve your problem smoothly.

KEarle commented 7 months ago

I'll try to document my changes in another discussion once I completed my process, but I'll just add this as a note to people trying to use this discussion make modifications to yolov8 permitting higher channel count imagery (13 spectral channels in my instance), since this thread helped me. There's a few more changes that need to be completed on my end, though it does work very well. The change for cv2.imread involving cv2.IMREAD_UNCHANGED will work for 4 channel imagery, but beyond that you will need to add in an alternative method of loading files into the program. In my case, I utilized tifffile to load 13-channel tiffs, but this could also be done by loading numpy files with np.load, though that requires adding an additional supported image format of ".npy".

glenn-jocher commented 7 months ago

@KEarle hello,

Thanks for sharing your experience using YOLOv8 with imagery of more than three channels. It's great to see the model's adaptability to such use cases.

Your note about CV2.imread is very useful. You're correct in stating that cv2.IMREAD_UNCHANGED only works effectively with 4-channel imagery; for any data with more than 4 channels, a different method would be necessary. You've mentioned using tifffile to load 13-channel TIFF files. This serves as a good example of how flexibility in data loading can support a wide variety of inputs and formats.

Another option, as you mentioned, could be to use numpy's np.load function to load numpy files directly. This would likely involve extending the list of supported image formats to include '.npy'.

Your upcoming documentation could be hugely beneficial for other users with similar use cases. Feel free to continue the discussion and share your final process. Your efforts to increase the range of supported data types for YOLOv8 are greatly appreciated.

lesreaper commented 7 months ago

Yes @KEarle, I would really appreciate that! We're running into challenges still on the training, and it's difficult to know if it's architecture or data.

glenn-jocher commented 7 months ago

Sure, @lesreaper. It sounds like you’re dealing with a complex situation. Debugging issues like these can be a bit tough when it's not clear whether the data or the model architecture is the source of the problem.

For your situation, you may want to consider validating both your architecture and data separately.

You could do a quick check on the architecture by using a simple, well-understood dataset to verify whether the model can train and learn properly. If the architecture seems solid, then you could focus on the data.

When reviewing your data, ensure that all instances are properly formatted and consistent. Also, normalization or standardization might be necessary if the data ranges widely or comes from different sources.

Remember, patience is key here. The trial-and-error nature of this work can be time-consuming, but thorough testing will help you pinpoint the issue and resolve it. Good luck!

KEarle commented 7 months ago

@lesreaper

Are you getting code errors or poor results problems still?

More on the poor results side, one thing that earlier changes notes that I don't quite think is nessecary is the disabling of augmentations. They have to be changed to work with higher-channel count imagery, but most of the augmentations will work with any number.

RandomHSV is the only augmentation that will not work with more than 3 channels outright. It relies on the CV2 conversion of BGR to HSV images. In my work I replaced the code of the RandomHSV class with a simple multiplication of all channel by a random uniform value. I mean to rework this into a narrower random per-channel multiplication, but that it's not quite there yet.

LetterBox is the other augmentation that can causes issues, but it will work with 4 channel images. In particular "cv2.copyMakeBorder()" is the issue, but the functionality can be replaced manually with three lines of numpy array operations, and probably fewer if you're more capable with computer programming that I am.

glenn-jocher commented 7 months ago

Hi @KEarle,

I understand you're experiencing more issues on the side of poor results as opposed to code errors.

One thing that strikes me is your understanding of the image augmentations. Disabling them might not be necessary, however, they need to be adapted to work with higher-channel count imagery.

For example, the RandomHSV augmentation operates on the assumption of three channels using the BGR to HSV images conversion from CV2. If you're working with more than three channels, you could modify the RandomHSV class. Instead of using the traditional HSV conversion, consider using a multiplication technique across all channels by a random uniform value. It's worth noting this is an area of ongoing work and potential enhancements could be made, such as per-channel multiplication adjustments.

Regarding the LetterBox augmentation, it may function with 4-channel images but can present problems. The function "cv2.copyMakeBorder()" is the specific concern here. However, it's functionality could be replicated with a few lines of numpy operations in order to adapt to the four channel context.

Please remember that training a model on non-standard data, such as images with more than 3 channels, is by nature more complex and adjustments will likely need to be done at several levels in your pipeline. I hope my explanation gives you some hints on how to tweak the model to fit your specific requirements.

john-lihn commented 7 months ago

I also used multi-band detection with 25 bands for this step. However, the learning of green did not improve, and on the output image, I found multiple object selections. I'm not sure what the specific problem is. Currently, I'm trying to modify the network. However, after reviewing previous literature, I found that the issue may arise from the results obtained after multi-scale head detection, and it may not have any effect on the fully connected layer. Or, could it be caused by other events?

However, it still performs class classification, so I think it's not very relevant.

image

glenn-jocher commented 7 months ago

@john-lihn hi,

It seems that you're dealing with potentially multiple overlapping issues, which might be contributing to the challenges you're facing with multi-band detection with 25 bands.

If you're experiencing issues with learning green and are seeing multiple object selections in the output, it's possible that these could be symptoms of one or more underlying issues. It could potentially be related to the multi-scale 'head' detection you've mentioned, but without more information or test results, it's difficult to say for certain.

As for your point on the fully connected layer possibly not being impacted, it might be worth investigating whether there's a specific reason for this. There may be other factors at play that could be influencing the model's performance.

Also, unless the 'class' misclassifications are symptomatic of a larger issue, they might not be the primary concern, although they're certainly something to keep in mind as you continue to debug and optimize your model.

Overall, given the complex nature of multi-band detection and the potential interplay of various factors, debugging can be quite challenging. You may find it helpful to isolate different portions of your model and run tests to evaluate their functionality independently.

This could include input preparation, model architecture & parameters, the training process, and the processing of the model's output. Identifying whether the problem arises from one, or several, of these components could help pinpoint potential improvements or modifications needed. Good luck with troubleshooting.

john-lihn commented 7 months ago

@glenn-jocher hi,

Later, when I was organizing YOLO, I found an article explaining that the NMS (Non-Maximum Suppression) step is performed before the final results are output to remove overlapping regions. So, one possible reason for overlapping regions is that we may encounter this issue if we skip this step. If this is the case, how should I add it in?

image

sreid19 commented 6 months ago

@lesreaper

Are you getting code errors or poor results problems still?

More on the poor results side, one thing that earlier changes notes that I don't quite think is nessecary is the disabling of augmentations. They have to be changed to work with higher-channel count imagery, but most of the augmentations will work with any number.

RandomHSV is the only augmentation that will not work with more than 3 channels outright. It relies on the CV2 conversion of BGR to HSV images. In my work I replaced the code of the RandomHSV class with a simple multiplication of all channel by a random uniform value. I mean to rework this into a narrower random per-channel multiplication, but that it's not quite there yet.

LetterBox is the other augmentation that can causes issues, but it will work with 4 channel images. In particular "cv2.copyMakeBorder()" is the issue, but the functionality can be replaced manually with three lines of numpy array operations, and probably fewer if you're more capable with computer programming that I am.

Hi there,

I am currently working with tif files as well, but for some reason when making the modifications listed above the model is not able to have box loss anymore. If you wouldn't mind, could you share which modifications you made to work with hyperspectral imagery? I am using tiffile to read and I disabled the augmentation, however you noted this makes the model worse.

glenn-jocher commented 6 months ago

@sreid19 hi there,

I understand you are working with hyperspectral imagery and facing challenges related to box loss after making certain modifications. While you've already tried disabling the augmentations, as you've noticed, this is not always beneficial to the model's performance.

For handling hyperspectral imagery in YOLOv8, focused modifications might be needed for each specific augmentation to account for the higher number of channels compared with standard RGB images. For instance, augmentations that involve color space transformations, like RandomHSV, typically need to be adapted for imagery beyond three channels since they're often designed with RGB images in mind.

With respect to the model not showing any box loss, it suggests that the model may no longer be detecting or localizing objects effectively in the modified input data. This could be due to several reasons, such as issues with preprocessing steps, adaptation of the network layers to handle the hyperspectral data, or incorrect modification of loss functions.

A good step would be to confirm that the input data is correctly formatted and fed into the model. Ensure that any preprocessing or augmentation preserves the integrity of the object annotations and that the model's architecture is capable of processing the higher-dimensional data effectively.

It may also be helpful to progressively implement and test changes rather than applying multiple modifications simultaneously. This approach can allow you to identify which specific change is causing the issue with box loss.

Since you're already using tiffile to read the data, consider whether each step of your preprocessing pipeline, including any data transformations, is compatible with hyperspectral imagery and the expected input format for YOLOv8.

I hope this helps guide you toward a solution that facilitates effective training with hyperspectral datasets. Keep iterating on your approach, and with careful adjustment, you should be able to maintain the integrity of your data's informative characteristics throughout the model training process.

lesreaper commented 6 months ago

@KEarle

@lesreaper

Are you getting code errors or poor results problems still?

I trained two models for a bit (they take forever on this 3060), and I'm not getting very good results. I'm specifically focused on pose detection, and it's just not picking things up.

I'm starting the model from scratch, and the training and loss validation goes down until about 90 epochs, then starts over-fitting (validation and training loss spikes). My only non-default change was the batch size which I changed to 32.

I started training another run based on the previous run using a Learning rate of 0.001, but the results aren't that great so far.

glenn-jocher commented 6 months ago

@lesreaper it sounds like you're facing a couple of challenges with your model training for pose detection, including long training times, overfitting, and unsatisfactory results.

Given your description, here are a few suggestions that might help improve your model's performance:

  1. Training Duration and Overfitting: Extended training on a single dataset, especially if it lacks sufficient variety, may lead to overfitting. As you noted, the validation and training loss begin to diverge around 90 epochs, indicating the model is starting to overfit to the training data. To combat this, you could employ techniques such as early stopping where you halt the training process once the validation loss starts to increase, or data augmentation to increase the diversity of your training set.

  2. Data Augmentation: As you have a focus on pose detection, ensure your data augmentation strategies support the learning of poses effectively. Sometimes, classical augmentations (e.g., random cropping, flipping) might remove critical pose information that is essential for accurate detection.

  3. Learning Rate: Adjusting the learning rate can have a significant impact on the performance of the model. If you've started training another run with a different learning rate and are still not obtaining desired results, consider using a learning rate scheduler that reduces the learning rate upon hitting a learning plateau instead of using a fixed learning rate throughout the training.

  4. Batch Size: You mentioned that you've changed the batch size to 32. The batch size can affect the stability and convergence of your training. If you have the memory capacity, experimenting with different batch sizes might lead to better results.

  5. Starting from Scratch vs. Transfer Learning: While starting from scratch is a valid approach, you might also consider transfer learning, if applicable. This is especially helpful when training a model on a niche task like pose detection, as pre-trained weights can speed up convergence and potentially lead to better generalization.

  6. Model Architecture: Since you're working on pose detection, ensure that the model architecture is suitable for the task. Pose detection may benefit from architectures that are good at identifying key points and spatial relationships in the input data.

  7. Review Dataset and Annotations: Ensure that the dataset and corresponding annotations are high quality and accurately reflect the pose information needed for detection. Faulty or imprecise annotations can significantly hamper model performance.

Keep experimenting with these aspects. Some of these may require adjustments to the model configuration or require experiments that blend a mix of these factors for optimal results. Remember, iterative improvements and consistent evaluations against a validation set will be crucial in finding the sweet spot for your particular application.

DnEcing commented 5 months ago

@lesreaper

您是否仍然遇到代码错误或结果不佳的问题?

在糟糕的结果方面,早期的更改指出的一件事是,我不太认为是nessecary,那就是禁用增强。必须更改它们才能使用更高通道数的影像,但大多数增强都适用于任何数字。

RandomHSV 是唯一一种不能完全使用超过 3 个通道的增强技术。它依赖于 BGR 到 HSV 图像的 CV2 转换。在我的工作中,我将 RandomHSV 类的代码替换为所有通道的简单乘以随机统一值。我的意思是将其重新设计为更窄的随机每通道乘法,但它还没有完全实现。

LetterBox 是另一种可能导致问题的增强功能,但它适用于 4 通道图像。特别是“cv2.copyMakeBorder()”是问题所在,但该功能可以手动替换为三行 numpy 数组操作,如果您像我一样更擅长计算机编程,可能会更少。

我跟您一样遇到了同样的问题,我通过额外加载红外图片的,并且通过cv2.merge((img,img_ir))拼接成6channels的输入,但是在randomHSV中遇到了问题,我似乎通过img[...,3:]和img[...,:3]解决了randomHSV的报错问题,但是在LetterBox这里始终无法解决 img = cv2.copyMakeBorder(img, top, bottom, left, right, cv2.BORDER_CONSTANT, cv2.error: OpenCV(4.8.0) /io/opencv/modules/core/src/copy.cpp:1074: error: (-215:Assertion failed) value[0] == value[1] && value[0] == value[2] && value[0] == value[3] i

KEarle commented 5 months ago

@KEarle

@lesreaper Are you getting code errors or poor results problems still?

I trained two models for a bit (they take forever on this 3060), and I'm not getting very good results. I'm specifically focused on pose detection, and it's just not picking things up.

I'm starting the model from scratch, and the training and loss validation goes down until about 90 epochs, then starts over-fitting (validation and training loss spikes). My only non-default change was the batch size which I changed to 32.

I started training another run based on the previous run using a Learning rate of 0.001, but the results aren't that great so far.

Apologies for the late reply, I've been caught up busy with my work and missed replying. As it was a month ago did you have any luck since then?

@DnEcing

Apologies, but I can't speak Chinese. I'll translate this bit with google translate. 抱歉,我不会说中文。我将用谷歌翻译来翻译这一点。

Looking at the translation, it seems like its getting caught up on the augment functions in ultralytics/data/augment.py. You'll have to change: 看看翻译,它似乎陷入了 ultralytics/data/augment.py 中的增强功能。你必须改变:

img = cv2.copyMakeBorder(img, top, bottom, left, right, cv2.BORDER_CONSTANT, value=(114,)*img.shape[2]) # add border

to something that looks like this: 看起来像这样的东西:

        border_img_array = np.full((img.shape[0]+top+bottom,img.shape[1]+left+right,img.shape[2]), 114, dtype=dtype).astype(dtype)
        border_img_array[top:top+img.shape[0], left:left+img.shape[1], :] = img
        img = border_img_array.astype(dtype)

The openCV/CV2 library in most places in the code will need to be removed, avoided or replicated using other methods. Many functions will only accept up to four bands. More specifically, openCV/ loads up imagery into different colour spaces and the largest I think supports only four (CMYK/ for instance). 代码中大多数地方的 openCV/CV2 库需要删除、避免或使用其他方法复制。许多功能最多只接受四个频段。更具体地说,openCV/ 将图像加载到不同的颜色空间中,我认为最大的颜色空间仅支持四种(例如 CMYK/)。

DnEcing commented 5 months ago

@KEarle

@lesreaper Are you getting code errors or poor results problems still?

I trained two models for a bit (they take forever on this 3060), and I'm not getting very good results. I'm specifically focused on pose detection, and it's just not picking things up. I'm starting the model from scratch, and the training and loss validation goes down until about 90 epochs, then starts over-fitting (validation and training loss spikes). My only non-default change was the batch size which I changed to 32. I started training another run based on the previous run using a Learning rate of 0.001, but the results aren't that great so far.

Apologies for the late reply, I've been caught up busy with my work and missed replying. As it was a month ago did you have any luck since then?

@DnEcing

Apologies, but I can't speak Chinese. I'll translate this bit with google translate. 抱歉,我不会说中文。我将用谷歌翻译来翻译这一点。

Looking at the translation, it seems like its getting caught up on the augment functions in ultralytics/data/augment.py. You'll have to change: 看看翻译,它似乎陷入了 ultralytics/data/augment.py 中的增强功能。你必须改变:

img = cv2.copyMakeBorder(img, top, bottom, left, right, cv2.BORDER_CONSTANT, value=(114,)*img.shape[2]) # add border

to something that looks like this: 看起来像这样的东西:

        border_img_array = np.full((img.shape[0]+top+bottom,img.shape[1]+left+right,img.shape[2]), 114, dtype=dtype).astype(dtype)
        border_img_array[top:top+img.shape[0], left:left+img.shape[1], :] = img
        img = border_img_array.astype(dtype)

The openCV/CV2 library in most places in the code will need to be removed, avoided or replicated using other methods. Many functions will only accept up to four bands. More specifically, openCV/ loads up imagery into different colour spaces and the largest I think supports only four (CMYK/ for instance). 代码中大多数地方的 openCV/CV2 库需要删除、避免或使用其他方法复制。许多功能最多只接受四个频段。更具体地说,openCV/ 将图像加载到不同的颜色空间中,我认为最大的颜色空间仅支持四种(例如 CMYK/)。

Yes, I saw this description in other forums: In the process of expanding the boundary of the color image, I found that cv2.copyMakeBorder() can only process the color image with the dimension style of (h, w, channel). If your color image is read out as (channel, h, w) or other formats, you need to convert the dimension, otherwise an error will be reported. In addition, I also found in my own experiment that when reading RGB original img, then reading the gray mode of the original through cv2.imread(img,0), and combining with cv2.merge((img_ir,img)), in the LetterBox: img = cv2.copyMakeBorder(img, top, bottom, left, right, cv2.BORDER_CONSTANT, value=(114, 114, 114)) instead img = cv2.copyMakeBorder(img, top, bottom, left, right, cv2.BORDER_CONSTANT, value=(114, 114, 114, 114)) There will be no error