pytorch / vision

Datasets, Transforms and Models specific to Computer Vision
https://pytorch.org/vision
BSD 3-Clause "New" or "Revised" License
16.34k stars 6.97k forks source link

[feature request] [discussion] mask utils in core #4415

Open vadimkantorov opened 3 years ago

vadimkantorov commented 3 years ago

🚀 The feature

  1. Extracting bounding boxes from label map: https://github.com/pytorch/pytorch/issues/22378#issuecomment-881954924, https://github.com/pytorch/vision/issues/3960 - scatter_reduce now supports amin/amax, so can be done in batched regime
  2. Extracting label maps from RGB label maps (https://github.com/pytorch/pytorch/issues/5436)
  3. Conversion of RGB uint8 tensors to RGBA (or ARGB) uint32 tensor (https://github.com/pytorch/pytorch/issues/5436#issuecomment-920034956) for extracting "unique" labels faster
  4. Compression of masks (relevant for images with many objects / segments), e.g. RLE encoding / decoding as found in pycocotools.mask

Motivation, pitch

In detection/segmentation these utils are very frequent

Alternatives

No response

Additional context

No response

oke-aditya commented 3 years ago

Can you elaborate a bit, I'm not very experienced to understand the above fully. :smiley:

vadimkantorov commented 3 years ago

In semantic/instance segmentation context the segmentations are usually represented using some sort of masks: 1) integer label maps. For every pixel there is some ID of an instance/category ID the object belongs to 2) RGB label maps, Every pixel is assigned an RGB value, each unique RGB tuple stands for an instance/category object ID. This is how Pascal and other datasets represent the ground truth. This representation is also useful for visual inspection of results of segmentation of multiple objects in an image 3) Binary mask map where the batch dimension enumerates the objects/categories present in an image 3) Compressed RLE representation - used in COCO/official pycocotools. This allows for efficient storage of masks if there are many instances in the image

Some representations are used for the ground truth in datasets and optimized for efficient storage, others are more convenient for learning targets, or for manipulating the masks in the code

From these representations it is often needed to: 1) convert between these representations, e.g. extract unique RGB colors; get a integer label map from RGB label map; get binary masks from integer label maps; convert to RGB label maps for visualization; compress-decompress masks 2) extract bounding boxes corresponding to each segment

oke-aditya commented 3 years ago

Hi @vadimkantorov

Just like bounding boxes, I think there are multiple formats for segmentation masks as you mentioned.

  1. To convert between the representations: -

Unlike boxes, we cannot interchangeably convert to each other types. E.g. from binary (boolean) mask it won't be possible to get same RGB Label maps. Although vice versa conversion is feasible. Hence, it might be useful to provide such a utility for converting masks. But we need to ensure that they won't pollute the namespace by providing too many functions. Maybe something like mask_convert ?

  1. To visualize and convert masks to boxes, etc.

Just like in boxes, we assume boxes to be of Pascal VOC format (xmin, ymin, xmax, ymax). In torchvision all utilities related to segmentation masks are written considering the Case 3. Binary boolean masks, where batch denotes number of objects, and remaining two dimensions being boolean denoting presence of map.

Utility are provided to visualize boolean masks. See draw_segmentation_masks It can be used both for instance segmentation as well as semantic segmentation models.

Also, See the recently added masks_to_boxes #4290 operator. It can help to find bounding boxes provided a boolean mask. Which can further be used to train a detection model. A comprehensive example would be up soon in gallery. #4484

So if there is an utility code that can help in converting different masks to boolean tensors, it would suffice the need. Let me know your thoughts @vadimkantorov

cc @datumbox @NicolasHug as they would know use cases better :smiley:

vadimkantorov commented 3 years ago

Let me know your thoughts @vadimkantorov

Honestly, I think the most practical thing would be to have utility functions that allow the maximum flexibility to convert between all these formats: integer label maps, rgb label maps, binary masks (maybe even bit masks), RLE compression (and maybe some other simple compressed representations)

Unlike boxes, we cannot interchangeably convert to each other types. E.g. from binary (boolean) mask it won't be possible to get same RGB Label maps.

For visualization purposes, it may still make sense to support Boolean -> RGB via letting the user to provide the palette (+ having an utility function to generate palettes from HSL colorwheel), e.g. one can map 0th binary mask to the 0th color from the palette

In torchvision all utilities related to segmentation masks are written considering the Case 3.

For high-resolution images with a lot of objects, this can become a bottleneck memory-wise. I guess that's the reason why COCO uses RLE compression.

Even if in torchvision this is the case, it is not the case for a lot of legacy and interop formats. I think it is very useful to support functions to convert between all of the formats as much as possible. Even for bounding boxes, there may be different ways of interpreting the boxes: https://ppwwyyxx.com/blog/2021/Where-are-Pixels/, so I think it's useful to have functions for conversion between xyxy to xywh and cxcyhalfwhalfh etc and maybe even accepting some argument specifying the coordinate frame (corners or pixel centers)

Maybe something like mask_convert ?

Maybe. But even if it pollutes some special masks or segmentation namespace, I don't think it's very disturbing. Here are tensorflow utils: https://github.com/tensorflow/tpu/tree/master/models/official/detection/utils, detectron utils: https://detectron2.readthedocs.io/en/latest/_modules/detectron2/layers/mask_ops.html, https://github.com/facebookresearch/detr/blob/main/util/box_ops.py

oke-aditya commented 3 years ago

Hi @vadimkantorov

  1. I agree that there should be flexibility in utility functions and try to adopt many standard conversions used popularly.
  2. The visualization utility do allow you to choose a color palette or by default it generates colors. But supporting RGB type masks directly would overload it and complicate. The utilities for visualization are aimed to be minimalist. Also they do the job without using libraries such as seaborn / matplotlib, etc. There maybe bottlenecks, but they are intended to be helpers and not be in critical path of code.
  3. Yes supporting conversions as much as possible sounds good. For boxes popular formats such as xyxy, xywh and cxcywh are supported. If people have additional needs for formats, then it can be considered to expand. Note that to keep things simple in conversions. We do all conversions through xyxy. E.g. to convert xywh to cxcywh we first do xywh -> xyxy then xyxy -> cxcywh, this happens internally, there is no direct code to do xywh -> cxcywh.
  4. All, Detr utils have been migrated to torchvision. Detectron2 utils can be found in detection/transforms.py. So they are too present in torchvision. I'm not sure about tensorflow utils.

Can you list out what other mask utilities would be beneficial in torchvision?

I see mask_convert as one candidate.

Maybe we can refer to Detectron2 masks? https://github.com/facebookresearch/detectron2/blob/main/detectron2/structures/masks.py

vadimkantorov commented 3 years ago

2. But supporting RGB type masks directly would overload it and complicate.

This already exists in legacy datasets such as Pascal, and I imagine this is the same in many other datasets from that epoch. So this is a very valid format for conversion. Direction integer label maps -> RGB label maps is also well defined even outside of purely visualization context. This conversion is needed to prepare the original "submission" files and use the original evaluation routines. So it may be good to rename this function or have a generic conversion function to redirect to it.

3. We do all conversions through xyxy.

Why not, but it should be super-clear in the docs what coordinate frame is used in the context of the problem explained in https://ppwwyyxx.com/blog/2021/Where-are-Pixels/

4. I'm not sure about tensorflow utils.

I brought this up only as a source of relevant existing places that do a lot of this conversions and may be a source of inspiration of real-world needs. Even if they were ported to transform.py, it would be good to refactor some of them and bring them over to more unified and generic mask_convert (and same for boxes), as you mentioned.

Maybe we can refer to Detectron2 masks?

I think overall it is good, but maybe an alternative could be to also have public "free functions" if the user does not want to use the classes (given that historically in pytorch support for Tensor subclasses/subtypes isn't very developed)

oke-aditya commented 3 years ago

Great. I agree with you about conversion formats.

So can if I understand correctly. mask_convert utility is what we need?

Or there are any other such free functions which are beneficial ?

cc @datumbox as he would understand the ideas better.

vadimkantorov commented 3 years ago

One other util from detectron2 - paste_masks_in_images

oke-aditya commented 3 years ago

This is already present in torchvision in roi_heads.py

https://github.com/pytorch/vision/blob/main/torchvision/models/detection/roi_heads.py#L401

vadimkantorov commented 2 years ago

paste_masks_in_images in detectron2 has slightly different API, e..g supports float threshold arg: https://github.com/facebookresearch/detectron2/blob/c85f21fd9e64620a30eff57f4185374a1f9ace7b/detectron2/layers/mask_ops.py#L74

If they are equivalent, it would be best if detectron2 migrated to torchvision version for avoiding confusion between their functionality

vadimkantorov commented 2 years ago

It's probably also worth to promote it to a higher-level namespace for more visibility and supportability

vadimkantorov commented 1 year ago

Also, See the recently added masks_to_boxes #4290 operator.

Just checked it again. It seems that batching is not vectorized (though for binary mask format can be vectorized by scatter_reduce amin/amax modes) - but it would be useful, as extracting connected-components/superpixel stats about segments is useful (both from binary masks and from integer masks that have segment index)

Most of box ops there unnecessarily do not support multiple batch dimensions. It can mostly be fixed by replacing [:, by [...,, as often we have two batch dimensions: batch of images x fixed number of boxes per image

Also, at ops level, it's super important for docs to explain which format are masks expected in, as in different contexts different formats are most useful. IMO verbosity here is only useful