Open vadimkantorov opened 3 years ago
Can you elaborate a bit, I'm not very experienced to understand the above fully. :smiley:
In semantic/instance segmentation context the segmentations are usually represented using some sort of masks: 1) integer label maps. For every pixel there is some ID of an instance/category ID the object belongs to 2) RGB label maps, Every pixel is assigned an RGB value, each unique RGB tuple stands for an instance/category object ID. This is how Pascal and other datasets represent the ground truth. This representation is also useful for visual inspection of results of segmentation of multiple objects in an image 3) Binary mask map where the batch dimension enumerates the objects/categories present in an image 3) Compressed RLE representation - used in COCO/official pycocotools. This allows for efficient storage of masks if there are many instances in the image
Some representations are used for the ground truth in datasets and optimized for efficient storage, others are more convenient for learning targets, or for manipulating the masks in the code
From these representations it is often needed to: 1) convert between these representations, e.g. extract unique RGB colors; get a integer label map from RGB label map; get binary masks from integer label maps; convert to RGB label maps for visualization; compress-decompress masks 2) extract bounding boxes corresponding to each segment
Hi @vadimkantorov
Just like bounding boxes, I think there are multiple formats for segmentation masks as you mentioned.
Unlike boxes, we cannot interchangeably convert to each other types. E.g. from binary (boolean) mask it won't be possible to get same RGB Label maps. Although vice versa conversion is feasible.
Hence, it might be useful to provide such a utility for converting masks. But we need to ensure that they won't pollute the namespace by providing too many functions. Maybe something like mask_convert
?
Just like in boxes, we assume boxes to be of Pascal VOC format (xmin, ymin, xmax, ymax)
.
In torchvision all utilities related to segmentation masks are written considering the Case 3. Binary boolean masks, where batch denotes number of objects, and remaining two dimensions being boolean denoting presence of map.
Utility are provided to visualize boolean masks. See draw_segmentation_masks
It can be used both for instance segmentation as well as semantic segmentation models.
Also, See the recently added masks_to_boxes
#4290 operator. It can help to find bounding boxes provided a boolean mask. Which can further be used to train a detection model. A comprehensive example would be up soon in gallery. #4484
So if there is an utility code that can help in converting different masks to boolean tensors, it would suffice the need. Let me know your thoughts @vadimkantorov
cc @datumbox @NicolasHug as they would know use cases better :smiley:
Let me know your thoughts @vadimkantorov
Honestly, I think the most practical thing would be to have utility functions that allow the maximum flexibility to convert between all these formats: integer label maps, rgb label maps, binary masks (maybe even bit masks), RLE compression (and maybe some other simple compressed representations)
Unlike boxes, we cannot interchangeably convert to each other types. E.g. from binary (boolean) mask it won't be possible to get same RGB Label maps.
For visualization purposes, it may still make sense to support Boolean -> RGB via letting the user to provide the palette (+ having an utility function to generate palettes from HSL colorwheel), e.g. one can map 0th binary mask to the 0th color from the palette
In torchvision all utilities related to segmentation masks are written considering the Case 3.
For high-resolution images with a lot of objects, this can become a bottleneck memory-wise. I guess that's the reason why COCO uses RLE compression.
Even if in torchvision this is the case, it is not the case for a lot of legacy and interop formats. I think it is very useful to support functions to convert between all of the formats as much as possible. Even for bounding boxes, there may be different ways of interpreting the boxes: https://ppwwyyxx.com/blog/2021/Where-are-Pixels/, so I think it's useful to have functions for conversion between xyxy to xywh and cxcyhalfwhalfh etc and maybe even accepting some argument specifying the coordinate frame (corners or pixel centers)
Maybe something like
mask_convert
?
Maybe. But even if it pollutes some special masks
or segmentation
namespace, I don't think it's very disturbing. Here are tensorflow utils: https://github.com/tensorflow/tpu/tree/master/models/official/detection/utils, detectron utils: https://detectron2.readthedocs.io/en/latest/_modules/detectron2/layers/mask_ops.html, https://github.com/facebookresearch/detr/blob/main/util/box_ops.py
Hi @vadimkantorov
xyxy
, xywh
and cxcywh
are supported. If people have additional needs for formats, then it can be considered to expand. Note that to keep things simple in conversions. We do all conversions through xyxy. E.g. to convert xywh to cxcywh we first do xywh -> xyxy then xyxy -> cxcywh, this happens internally, there is no direct code to do xywh -> cxcywh.Can you list out what other mask utilities would be beneficial in torchvision?
I see mask_convert
as one candidate.
Maybe we can refer to Detectron2 masks? https://github.com/facebookresearch/detectron2/blob/main/detectron2/structures/masks.py
2. But supporting RGB type masks directly would overload it and complicate.
This already exists in legacy datasets such as Pascal, and I imagine this is the same in many other datasets from that epoch. So this is a very valid format for conversion. Direction integer label maps -> RGB label maps is also well defined even outside of purely visualization context. This conversion is needed to prepare the original "submission" files and use the original evaluation routines. So it may be good to rename this function or have a generic conversion function to redirect to it.
3. We do all conversions through xyxy.
Why not, but it should be super-clear in the docs what coordinate frame is used in the context of the problem explained in https://ppwwyyxx.com/blog/2021/Where-are-Pixels/
4. I'm not sure about tensorflow utils.
I brought this up only as a source of relevant existing places that do a lot of this conversions and may be a source of inspiration of real-world needs. Even if they were ported to transform.py, it would be good to refactor some of them and bring them over to more unified and generic mask_convert
(and same for boxes), as you mentioned.
Maybe we can refer to Detectron2 masks?
I think overall it is good, but maybe an alternative could be to also have public "free functions" if the user does not want to use the classes (given that historically in pytorch support for Tensor subclasses/subtypes isn't very developed)
Great. I agree with you about conversion formats.
So can if I understand correctly. mask_convert
utility is what we need?
Or there are any other such free functions which are beneficial ?
cc @datumbox as he would understand the ideas better.
One other util from detectron2 - paste_masks_in_images
This is already present in torchvision in roi_heads.py
https://github.com/pytorch/vision/blob/main/torchvision/models/detection/roi_heads.py#L401
paste_masks_in_images
in detectron2 has slightly different API, e..g supports float threshold
arg: https://github.com/facebookresearch/detectron2/blob/c85f21fd9e64620a30eff57f4185374a1f9ace7b/detectron2/layers/mask_ops.py#L74
If they are equivalent, it would be best if detectron2 migrated to torchvision version for avoiding confusion between their functionality
It's probably also worth to promote it to a higher-level namespace for more visibility and supportability
Also, See the recently added
masks_to_boxes
#4290 operator.
Just checked it again. It seems that batching is not vectorized (though for binary mask format can be vectorized by scatter_reduce amin/amax modes) - but it would be useful, as extracting connected-components/superpixel stats about segments is useful (both from binary masks and from integer masks that have segment index)
Most of box ops there unnecessarily do not support multiple batch dimensions. It can mostly be fixed by replacing [:,
by [...,
, as often we have two batch dimensions: batch of images x fixed number of boxes per image
Also, at ops level, it's super important for docs to explain which format are masks
expected in, as in different contexts different formats are most useful. IMO verbosity here is only useful
🚀 The feature
Motivation, pitch
In detection/segmentation these utils are very frequent
Alternatives
No response
Additional context
No response