How can yolov5 use grayscale images for faster detection?

chengzihencai commented 2 years ago

Search before asking

[X] I have searched the YOLOv5 issues and discussions and found no similar questions.

Question

How can yolov5 use grayscale images for faster detection? I try to input grayscale images directly, but the detection speed will become slower. What parts do I need to modify.Directly input grayscale image, yolov5 will convert the data into RGB image. Is that the reason? If so, where do I need to modify to improve this problem.

Additional

No response

glenn-jocher commented 2 years ago

YOLOv5 processes 3ch inputs by default. Single channel speedup is about zero since >99% of all operations are not in the first convolution.

ExtReMLapin commented 2 years ago

On another end, you could use grayscale to be able to stack more images in RAM, 3 times more data, or even in gpu with 3times larger batches maybe ?

glenn-jocher commented 2 years ago

@ExtReMLapin constraint is CUDA memory and FLOPs, >>99% of which have no correlation to the input channel count.

chengzihencai commented 2 years ago

@ExtReMLapin constraint is CUDA memory and FLOPs, >>99% of which have no correlation to the input channel count.

There are many kinds of targets I want to detect, and only their colors are different. Can you provide suggestions on whether to use grayscale images to detect or label all objects as one?

ExtReMLapin commented 2 years ago

@ExtReMLapin constraint is CUDA memory and FLOPs, >>99% of which have no correlation to the input channel count.

There are many kinds of targets I want to detect, and only their colors are different. Can you provide suggestions on whether to use grayscale images to detect or label all objects as one?

In other words, you're saying you want to detect red, green and orange apples just as "apple", right ? Just try annotating them as apples without specifying the color.

BenDangHD commented 2 years ago

I would like to use the channels to train on multiple frames at once to make use of the frame-to-frame differences of training videos e.g. using 18 channels to process a volume of 18 grayscale frames. It does seem to work but with unusable computation time. I replaced the load_image function used in getitem in the dataloader with

 def load_stacked_images(self, i):
        # Loads num_channels stacked images from dataset starting with index 'i', returns (im, original hw, resized hw)
        im, f, fn = self.ims[i], self.im_files[i], self.npy_files[i],
        if f is None:  # not cached in RAM
            raise ValueError('f is not defined and should be corrected!')
        else:
            if fn.exists():  # load npy
                raise ValueError('fn exists and should be corrected! ')
                # im = np.load(fn)
            else:  # read image
                im = cv2.imread(f)  # BGR
                image = np.expand_dims(cv2.cvtColor(im, cv2.COLOR_BGR2GRAY), axis=2)
                if self.channels > 1:
                    for index in range(i + 1, i + self.channels):
                        next_im = cv2.imread(self.im_files[index])  # BGR
                        next_im_gray = np.expand_dims(cv2.cvtColor(next_im, cv2.COLOR_BGR2GRAY), axis=2)
                        image = np.concatenate((image, next_im_gray), axis=2)
                assert im is not None, f'Image Not Found {f}'
            h0, w0 = image.shape[:2]  # orig hw
            r = self.img_size / max(h0, w0)  # ratio
            if r != 1:  # if sizes are not equal
                # interp = cv2.INTER_LINEAR if (self.augment or r > 1) else cv2.INTER_AREA
                # image = cv2.resize(image, (int(w0 * r), int(h0 * r)), interpolation=interp)
                # if self.channels == 1:
                #     image = np.expand_dims(image, axis=2)
                image = zoom(image, (r, r, 1))

            return image, (h0, w0), image.shape[:2]  # im, hw_original, hw_resized
        # return self.ims[i], self.im_hw0[i], self.im_hw[i]  # im, hw_original, hw_resized

It computes extremely slow even for the nano model and really small image sizes

github-actions[bot] commented 2 years ago

👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.

Access additional YOLOv5 🚀 resources:

Wiki – https://github.com/ultralytics/yolov5/wiki
Tutorials – https://docs.ultralytics.com/yolov5
Docs – https://docs.ultralytics.com

Access additional Ultralytics ⚡ resources:

Ultralytics HUB – https://ultralytics.com/hub
Vision API – https://ultralytics.com/yolov5
About Us – https://ultralytics.com/about
Join Our Team – https://ultralytics.com/work
Contact Us – https://ultralytics.com/contact

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!

kbegiedza commented 2 years ago

Single channel speedup is about zero since >99% of all operations are not in the first convolution.

@glenn-jocher How about preprocessing step speedup for inference? It's much faster to process 1/3 of data in order to prepare frame for inference.

github-actions[bot] commented 1 year ago

👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.

Access additional YOLOv5 🚀 resources:

Wiki – https://github.com/ultralytics/yolov5/wiki
Tutorials – https://docs.ultralytics.com/yolov5
Docs – https://docs.ultralytics.com

Access additional Ultralytics ⚡ resources:

Ultralytics HUB – https://ultralytics.com/hub
Vision API – https://ultralytics.com/yolov5
About Us – https://ultralytics.com/about
Join Our Team – https://ultralytics.com/work
Contact Us – https://ultralytics.com/contact

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!

glenn-jocher commented 1 year ago

@kbegiedza You could preprocess images into grayscale format before feeding them into the model. This can lead to a reduction in computation time for inference as the model will only need to process a third of the original data.

Duna2004 commented 10 months ago

I apologize in advance for addressing a closed matter, but I am a bit confused about one thing and I trust your expertise could clarify it for me. Specifically, I am considering using YOLO for people and vehicle detection. However, I suspect that grayscale input might work better than RGB. I would greatly appreciate it if you could explain why these assumptions may not be correct.

Since I am only using a third of the image, I can use a higher resolution image, which will increase the chance of detecting smaller objects.
I'm primarily interested in detecting people and vehicles, and I believe that training and using the network on grayscale images could improve detection, especially in nighttime conditions when most cameras are grayscale anyway.
Most of the decoders I use output some form of YUV. Therefore, by using only Y (luminance) exclusively, this makes it unnecessary to convert to RGB, thus speeding it up, I can save resources.
I feel that color is not essential in detecting people and vehicles.

ExtReMLapin commented 10 months ago

Since I am only using a third of the image, I can use a higher resolution image, which will increase the chance of detecting smaller objects.

I wouldn't take this as a real point as inference is quite cheap.

kbegiedza commented 10 months ago

Since I am only using a third of the image, I can use a higher resolution image, which will increase the chance of detecting smaller objects.

probably, yes

I'm primarily interested in detecting people and vehicles, and I believe that training and using the network on grayscale images could improve detection, especially in nighttime conditions when most cameras are grayscale anyway.

typically, in nighttime cameras are using NIR, it's different beast than simple grayscale (converting RGB -> GRAY will not give you same data as NIR image)

Most of the decoders I use output some form of YUV. Therefore, by using only Y (luminance) exclusively, this makes it unnecessary to convert to RGB, thus speeding it up, I can save resources.

true

I feel that color is not essential in detecting people and vehicles.

it depends, color features could be helpful for some applications

Duna2004 commented 10 months ago

Since I am only using a third of the image, I can use a higher resolution image, which will increase the chance of detecting smaller objects.

I wouldn't take this as a real point as inference is quite cheap.

Thank you for the answer, could you elaborate a bit? Regarding my experience and tests, there is a difference when detecting on 20 cameras at 15 fps whether it is the 640x480 or 1280x720 resolution that enters the neural network (resolutions are chosen for identical frame size when going from rgb 640×480×3=921600 to gray8 1280×720=921600).

This difference is reflected both in inference computational power and in the much better detection of smaller objects in the scene, but also (although only slightly better) for larger objects, this was all tested using rgb.

Duna2004 commented 10 months ago

typically, in nighttime cameras are using NIR, it's different beast than simple grayscale (converting RGB -> GRAY will not give you same data as NIR image)

I did not think that this would be much of a difference, is there some kind of image filter that could help with dataset for that?

Duna2004 commented 10 months ago

The main question for me is whether yolov5 or maybe yolov8 can be modified so that the grayscale changes have any effect. What I mean is, if by changing yolo to one channel on the input layer, the neural network will stil compute as if the channels were 3 a thus having negligible effects on the performance side.

Why does yolo use rgb instead of YUV on the input layer, based on this article YUV, YCrCb and HSV are indeed better than RGB.

glenn-jocher commented 10 months ago

@Duna2004 your inquiries touch on several aspects of image processing and neural network performance. Let's address them one by one:

Higher Resolution with Grayscale: Using grayscale images might allow you to process higher resolution images within the same computational budget, potentially improving the detection of smaller objects. However, the actual performance gain depends on the balance between resolution and the informative value of color features for the task.
Nighttime Detection: If your application involves nighttime conditions where cameras switch to grayscale or NIR, training the network directly on grayscale images could indeed be beneficial. However, NIR and grayscale are not the same; NIR captures information beyond the visible spectrum. A simple RGB to grayscale conversion won't replicate NIR characteristics. There's no straightforward filter to convert RGB to NIR-like images, as NIR captures different wavelengths.
YUV and Luminance Channel: Using the Y (luminance) channel from YUV could save resources if your decoder outputs YUV natively. This is because you can avoid the conversion to RGB. However, the performance gain from this step alone might be marginal compared to the overall inference time.
Color in Detection: The importance of color in detecting people and vehicles is context-dependent. In some scenarios, color cues are crucial, while in others, such as low-light conditions, they might be less important.
Modifying YOLO for Grayscale: YOLO models are typically trained on RGB images because they provide more information (color) which can be useful for detection. Modifying YOLO to accept single-channel grayscale images would involve changing the input layer and retraining the model on grayscale data. This could potentially reduce computational load slightly, but the main computational cost comes from deeper layers, so the speedup might be minimal.
RGB vs. YUV: The choice of RGB over YUV or other color spaces is often due to the availability of pre-trained models and datasets in RGB format. While other color spaces might offer advantages in certain applications, the conversion from RGB to these spaces can introduce additional computational overhead. The article you mentioned discusses the potential benefits of alternative color spaces, but the practical implications depend on the specific task and the dataset.

In summary, if you believe grayscale images are more representative of your application's conditions, you could experiment with retraining a YOLO model on grayscale images. Keep in mind that this requires a comprehensive grayscale dataset for training. The performance and speed gains would need to be empirically validated, as they depend on various factors, including the specific architecture and the hardware used for inference.

akiannejad123 commented 9 months ago

If there are others who need grayscale this medium article show how to augment the TRAIN and Dataloader file https://medium.com/@cristina.todoran_46685/yolo-v5-on-grayscale-images-9e8995444427 The only issue I am having with it is getting the model to validate, I either get an NMS error or channel problem

glenn-jocher commented 6 months ago

@akiannejad123 hi there! It sounds like you're making good progress with adapting YOLOv5 for grayscale images. If you're encountering NMS errors or channel issues during validation, it might be due to discrepancies in how the channels are handled during training versus validation. Ensure that your validation dataset is processed in the same way as your training dataset, specifically that images are consistently converted to grayscale across both.

Here’s a quick check you can do in your data loader to ensure images are correctly handled:

# Ensure images have a single channel dimension
if image.ndim == 2:
    image = image[:, :, None]

This snippet ensures that even if your image is read as a 2D array (grayscale), it's reshaped to have a third dimension, which YOLO expects. Double-check similar handling in both training and validation phases. Good luck! 😊

daxel123 commented 5 months ago

Great information, I'm currently working with videos in NIR and color (night and day conditions), I would like to know if I could retrain the model with mixture of images (NIR and RGB) to obtain a model capable to detect in both formats, or is preferred to have two models (day and night) for the inferences, thanks and great job with YOLO

glenn-jocher commented 5 months ago

Hi @daxel123,

Thank you for your kind words and for sharing your use case! Working with both NIR and RGB images sounds like an exciting project. Let's dive into your question:

Mixed Training vs. Separate Models

Mixed Training:

Pros: Training a single model with a mixture of NIR and RGB images can make the model more versatile, allowing it to generalize better across different lighting conditions. This approach can simplify deployment since you only need to manage one model.
Cons: The model might not perform optimally in either condition if the features in NIR and RGB images are significantly different. The network might struggle to learn the best features for both types of images simultaneously.

Separate Models:

Pros: Training separate models for day (RGB) and night (NIR) conditions can lead to better performance in each specific scenario. Each model can specialize in the features most relevant to its respective image type.
Cons: This approach requires managing two models, which can complicate deployment and increase the computational resources needed.

Recommendations

Experimentation: I recommend starting with a mixed dataset to see if a single model can achieve satisfactory performance. If the results are not as expected, you can then train separate models for day and night conditions.
Data Augmentation: Ensure that your dataset is well-balanced and that you use appropriate data augmentation techniques to help the model generalize better.
Validation: Use a validation set that includes both NIR and RGB images to monitor the model's performance across different conditions.

Implementation

To train a model with mixed data, you can simply include both NIR and RGB images in your training dataset. Ensure that your data loader handles both types of images correctly:

# Example of handling mixed data types in the data loader
def load_image(path):
    image = cv2.imread(path, cv2.IMREAD_UNCHANGED)
    if image.ndim == 2:  # Grayscale (NIR)
        image = cv2.cvtColor(image, cv2.COLOR_GRAY2RGB)
    return image

Next Steps

If you encounter any issues or need further assistance, please provide a minimum reproducible code example so we can better understand and address your specific situation. You can find more details on how to create one here.

Also, make sure you are using the latest versions of torch and YOLOv5 from our repository. Keeping your packages up-to-date ensures you benefit from the latest features and bug fixes.

Feel free to reach out with any more questions or updates on your progress. We're here to help! 😊

ultralytics / yolov5