ultralytics / ultralytics

Ultralytics YOLO11 🚀
https://docs.ultralytics.com
GNU Affero General Public License v3.0
30.89k stars 5.96k forks source link

What's the output of fast sam? #4490

Closed LLsmile closed 1 year ago

LLsmile commented 1 year ago

Search before asking

Question

I want to deploy fast sam to onnx format. But I cannot understand the structure of the onnx. Can anybody help me? Following is the brief description from netron. What does output0 and output1 stand for? [256,256] should be the size of the mask. But how about 32 in output1 and 37 and 21504 in output1?

Screenshot from 2023-08-22 16-25-24

Additional

No response

github-actions[bot] commented 1 year ago

👋 Hello @LLsmile, thank you for your interest in YOLOv8 🚀! We recommend a visit to the YOLOv8 Docs for new users where you can find many Python and CLI usage examples and where many of the most common questions may already be answered.

If this is a 🐛 Bug Report, please provide a minimum reproducible example to help us debug it.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset image examples and training logs, and verify you are following our Tips for Best Training Results.

Join the vibrant Ultralytics Discord 🎧 community for real-time conversations and collaborations. This platform offers a perfect space to inquire, showcase your work, and connect with fellow Ultralytics users.

Install

Pip install the ultralytics package including all requirements in a Python>=3.8 environment with PyTorch>=1.8.

pip install ultralytics

Environments

YOLOv8 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

Ultralytics CI

If this badge is green, all Ultralytics CI tests are currently passing. CI tests verify correct operation of all YOLOv8 Modes and Tasks on macOS, Windows, and Ubuntu every 24 hours and on every commit.

glenn-jocher commented 1 year ago

@LLsmile hello,

The SAM (Segment Anything Model) architecture in YOLOv8 produces two outputs when exported to the ONNX format.

The first output (output0) corresponds to the pixel-wise segmentation mask of the size [256,256] as you mentioned. This output assists in understanding how each pixel in the image is classified among different object classes defined in the model.

The second output (output1), which has three dimensions (dim0: 32, dim1: 37, and dim2: 21504), corresponds to the multi-scale bounding box predictions made by the model. These predictions are crucial for the task of object detection.

Here, 32 signifies the number of grids that the input image is divided into. For each grid cell, the model predicts 37 attributes. These 37 attributes include information like bounding box coordinates, objectness score, and class probabilities. The final dimension, 21504, defines the overall number of bounding box predictions made by the model over all grid cells at multiple scales.

Hope this clears up your confusion. Please feel free to ask if you have any more questions about the model's output.

LLsmile commented 1 year ago

Thanks for your explanation.

LLsmile commented 1 year ago

Well, the outputs seems a little bit complicated. There area a few questions about the onnx model. First, how to preprocess the image to feed into the onnx? The example code of onnx in this repo shows there are only a few operation as following. Is this enough?

# Convert the image color space from BGR to RGB
        img = cv2.cvtColor(self.img, cv2.COLOR_BGR2RGB)

        # Resize the image to match the input shape
        img = cv2.resize(img, (self.input_width, self.input_height))

        # Normalize the image data by dividing it by 255.0
        image_data = np.array(img) / 255.0

        # Transpose the image to have the channel dimension as the first dimension
        image_data = np.transpose(image_data, (2, 0, 1))  # Channel first

Second, how to arrange the 37 number? Are they arranged as x,y,w,h,conf,cls1_socre, cls2_score,...,? I need the absolute position of detected objects, so the bbox should be multiplied by 1024?

glenn-jocher commented 1 year ago

@LLsmile,

For the first question, the preprocessing steps you posted are indeed what you need to do. Here is a brief overview:

  1. Convert the image color from BGR to RGB: YOLOv8 models are trained with images in the RGB format.
  2. Resize the image: As YOLOv8 models expect a certain input size, you need to resize your image to match the size it was trained on.
  3. Normalize pixel values: Dividing by 255.0 is done to bring the pixel values between 0 and 1, as models are trained on normalized images.
  4. Transpose the image: The image is transposed to match the 'channel-first' format - this is how PyTorch, the library YOLOv8 was implemented in, expects the images to be formatted.

For the second part of your question, the arrangement of the 37 numbers in the output is as follows:

  1. The first four numbers correspond to the x, y, w, h coordinates of the bounding box. x and y denote the center of the box and w and h correspond to the width and height of the bounding box, respectively.
  2. The fifth value represents the objectness/confidence score.
  3. The remaining 32 values are the class scores for each of the classes the model has been trained on.

To get the absolute position of the objects detected, you are correct in thinking that the bounding box values should be multiplied by the dimension of the input image. This is because the bounding box coordinates are normalized to [0, 1] during training. However, do make sure to multiply x and w by the width of the image and y and h by the height of the image, to get accurate results.

This allows the model to give you the accurate position and size of the detected objects in the pre-processed image. This position will then need to be adjusted if you wish to locate the object in the original, non-resized image.

LLsmile commented 1 year ago

Emmm, I'm confused with the output of onnx. According to source code in this repo, it seems there is no objectness score in yolov8 detection model. As in another yolov8-seg repo, there is also no objectness score. But if there is no objectness score, what does the 37 values stand for? 4 bbox + 32 score and what?

glenn-jocher commented 1 year ago

@LLsmile yes, you're correct. The 37 values stand for the 4 bounding box coordinates (x, y, w, h), followed by the class scores for each of the 32 classes that the model has been trained on.

In some versions of YOLO, there is an objectness score that represents the confidence that an object is present within the bounding box. However, in the case of YOLOv8, and specifically with the segmentation task, it seems like the model does not produce an objectness score. This may be assumed to be a deliberate choice in the design of the model, possibly to simplify the output, or due to it not being crucial for the segmentation task.

In the absence of objectness scores, the class scores can also give a rough estimate of how confident the model is about the presence of an object - a high score in any of the classes would suggest the model is confident an object is present.

I hope this clarifies your confusion about the 37 values. Please let me know if you have any more questions.

Haroldhy commented 1 year ago

You can look for a fastsam library for tensorrt, which goes into great detail on post-processing https://github.com/ChuRuaNh0/FastSam_Awsome_TensorRT I've run and read the code for this library, and you can look at the post-processing part of his inference code

glenn-jocher commented 1 year ago

@Haroldhy hello,

Thank you for pointing out the FastSam TensorRT library. It's indeed useful for understanding the post-processing stage in detail.

In the context of YOLOv8 used in our repository, when we receive the outputs from the model, they are raw predictions that need to be post-processed in order to be interpreted.

These post-processing steps often include operations such as applying thresholds to filter out low confidence predictions, and non-max suppression to handle overlapping bounding boxes.

For the SAM (Segment Anything Model) architecture, it produces two outputs. The first is a pixel-wise segmentation mask, while the second output corresponds to the multi-scale bounding box predictions. For the latter, the output includes the bounding box coordinates and class probabilities for each grid cell at multiple scales.

In order to interpret these outputs, we need to perform post-processing. This includes transforming the normalized bounding box coordinates into actual pixel values, and identifying the detected class by taking the argmax of the class probabilities.

To prevent overlapping detections, we can apply non-max suppression using the class probabilities as the confidence scores.

I hope this provides a clear explanation of how to handle the model outputs for post-processing, even without referring to specific code examples. If you have any further queries, feel free to ask.

LLsmile commented 1 year ago

Thanks. I finished it by following the yolov8-seg. But I just treat the fast sam as a normal instance segmentation model to detect objects. But how can I get all segmentation result just like online demo shows? Lower box-threshold and merge all mask of detected objects?

glenn-jocher commented 1 year ago

@LLsmile hello,

You're on the right track! To get all the segmentation results similar to the online demo, there are indeed a few steps you could follow:

  1. Lowering the box-threshold: This will result in a larger number of detections, including detections with lower confidence. However, be cautious with this, as lowering the threshold too much might lead to false positives in your results.

  2. Merging all the masks of detected objects: After detection, each object should have its own corresponding mask. By merging all these detected object masks, you will be able to produce a comprehensive segmentation result covering all detected objects in the frame.

  3. Post-processing: Don't forget to perform post-processing after detection and before merging. This includes operations like filtering (accounting for the new threshold), non-max suppression, and more.

Keep up the good work and let me know if you have any more questions!

aman-agar commented 9 months ago

Hey! Can you also explain why does the .pt model return an embedding with size [512]. Also, pls explain the meaning of those embeddings

glenn-jocher commented 9 months ago

@aman-agar hello,

The [.pt model returns an embedding with size [512] as a feature vector, typically used to represent the content of an image in a compact form. These embeddings capture the high-dimensional data in a lower-dimensional space and can be useful for tasks like similarity comparison, clustering, or as input for further classification layers. The values themselves are abstract features that the model has learned to recognize during training and don't have an inherent meaning on their own outside the model's context. They are often used to compare images based on learned features rather than direct visual content.

Haroldhy commented 9 months ago

Thanks. I finished it by following the yolov8-seg. But I just treat the fast sam as a normal instance segmentation model to detect objects. But how can I get all segmentation result just like online demo shows? Lower box-threshold and merge all mask of detected objects?

If you're trying to get masks with ids via FastSAM, I think it's hard. Because his output is a number of masks, and you think the online demo classified all the masks just because he used different colors. But there is no detailed category information for these different color blocks, he is just plain masks, he simply stacked all the masks onto one image and painted them with different colors. I was also really hoping he would tell me which color block corresponds to what, but I failed. Let me know if you have any good suggestions as well, thanks.

glenn-jocher commented 9 months ago

@Haroldhy to achieve segmentation results akin to the online demo, you can indeed lower the detection threshold to capture more objects and then merge the masks. The demo likely visualizes each detected instance with a unique color for clarity, but without class-specific information. If you require distinct class labels for each mask, additional steps beyond the current FastSAM output are needed. Your understanding is correct; the demo does not provide class-to-color mapping. If you have ideas or further questions, feel free to share!

LLsmile commented 8 months ago

@LLsmile,

For the first question, the preprocessing steps you posted are indeed what you need to do. Here is a brief overview:

  1. Convert the image color from BGR to RGB: YOLOv8 models are trained with images in the RGB format.
  2. Resize the image: As YOLOv8 models expect a certain input size, you need to resize your image to match the size it was trained on.
  3. Normalize pixel values: Dividing by 255.0 is done to bring the pixel values between 0 and 1, as models are trained on normalized images.
  4. Transpose the image: The image is transposed to match the 'channel-first' format - this is how PyTorch, the library YOLOv8 was implemented in, expects the images to be formatted.

For the second part of your question, the arrangement of the 37 numbers in the output is as follows:

  1. The first four numbers correspond to the x, y, w, h coordinates of the bounding box. x and y denote the center of the box and w and h correspond to the width and height of the bounding box, respectively.
  2. The fifth value represents the objectness/confidence score.
  3. The remaining 32 values are the class scores for each of the classes the model has been trained on.

To get the absolute position of the objects detected, you are correct in thinking that the bounding box values should be multiplied by the dimension of the input image. This is because the bounding box coordinates are normalized to [0, 1] during training. However, do make sure to multiply x and w by the width of the image and y and h by the height of the image, to get accurate results.

This allows the model to give you the accurate position and size of the detected objects in the pre-processed image. This position will then need to be adjusted if you wish to locate the object in the original, non-resized image.

The detection results is strange in my dataset. In some case, the model works well in rgb format but in another case, it works well in bgr format. And I didn't find the convert function in training code. So, can you help me to locate the code that convert bgr to rgb in this repo? @glenn-jocher

glenn-jocher commented 8 months ago

@LLsmile,

The YOLOv8 models are trained with RGB images. There is no explicit BGR to RGB conversion within the training code, as the datasets typically used (like COCO) are loaded in RGB format by default. If you're experiencing inconsistencies in detection results between RGB and BGR, this might be due to the way your dataset images are loaded or pre-processed. Ensure your dataset loading pipeline conforms to the RGB standard for best results. If you need further assistance, please provide more details about your dataset and loading process.

thoron commented 8 months ago

Hello, I am having some issue understanding the pixel-wise segmentation mask output. Using this onnx model I get great bboxes (from output0) but trying to mask the contour is not yielding the expected results (as from FastSAM repo). image

I am trying to implement all post processing steps in C# to understand them better. Am I missunderstanding the post processing of the masking?

foreach (var prediction in finalBoxes)
{
    // Crop the segmentation mask corresponding to the bounding box
    var maskSegment = CropMask(output1, prediction.OriginalBox);

    // Extract the contour from the cropped mask
    var contour = ExtractContour(maskSegment, 0.5f);

    // Original image space box
    var box = prediction.Box; 
    // Transform contour coordinates back to original image space, all included pixels
    var transformedContour = contour.Select(point => (
        x: (int) (point.X * _ratio + box.Xmin),
        y: (int) (point.Y * _ratio + box.Ymin)
    ));
}

// name: output1
// tensor: float32[1,32,128,128]
static float[,] CropMask(Tensor<float> fullMask, RectangleF bbox)
{
    var croppedMask = new float[(int) Math.Ceiling(bbox.Width), (int) Math.Ceiling(bbox.Height)];

    for (int x = (int) bbox.X, i = 0; x < (int) bbox.Right; x++, i++)
    {
        for (int y = (int) bbox.Y, j = 0; y < (int) bbox.Bottom; y++, j++)
        {
            // Divide by 4 as imgsz / output1.imgsz
            // c = 0 as {0: 'object'}
            croppedMask[i, j] = fullMask[0, 0, x / 4, y / 4];
        }
    }

    return croppedMask;
}

private static IEnumerable<IModelProcessing.Point> ExtractContour(float[,] maskSegment, float threshold)
{
    var contour = new List<IModelProcessing.Point>();

    for (int y = 0; y < maskSegment.GetLength(1); y++)
    {
        for (int x = 0; x < maskSegment.GetLength(0); x++)
        {
            if (maskSegment[x, y] > threshold)
            {
                contour.Add(new(x, y));
            }
        }
    }

    return contour;
}

Thanks for taking the time to answer all these questions!

glenn-jocher commented 8 months ago

@thoron hello,

The issue might be related to how the segmentation mask (output1) is being processed. Ensure that you're correctly interpreting the mask's dimensions and scaling it relative to the bounding box size. The mask output is typically a lower resolution than the original image, so you'll need to resize the mask to match the bounding box dimensions before extracting the contour.

Also, verify that the threshold you're using to extract the contour is appropriate. If the threshold is too high, you might miss parts of the mask; if it's too low, you might include too much background.

Lastly, ensure that the contour extraction logic correctly identifies the object's edges and that the transformation back to the original image space is accurate.

If you continue to face issues, consider revisiting the post-processing steps in the FastSAM repo for reference, and ensure that your C# implementation mirrors those steps closely.

qscacheri commented 7 months ago

@thoron The above information is incorrect in regards to the meaning of the outputs. The 37 in output0 represents x, y, width, height, objectness and 32 mask coefficiants, not class scores as fast sam only has class. In order to get the actual masks, you need to combine output0 and output1. If you remove the first 5 rows and transpose such that you are left with a 21504x32 matrix and a 21504x5 matrix, you can multiply the 21504x32 @ 32x65536 (256x256 flattened) matrix to join them. this will leave you with a 21504x65536 where you now have all detected objects and a mask for each object. After applying NMS you just need to go through the remaining bounding boxes and crop that detected object's mask to only include the points inside the bounding box.