Ability to add extra custom roi-heads to generalizedRCNN models

🚀 Feature

This feature would allow adding custom RoI heads to any existing GeneralizedRCNN model.

Motivation

While the current functionalities of existing GeneralizedRCNN models are great, one might want to make extra predictions (like, e.g. the number of sides of an object) per detection, without having to alter the underlying torchvision code.

Pitch

The idea would be to be able to provide an extensions class (inheriting from RoIHeads), that preserves all the current behaviour but also exposes in the forward pass all the necessary elements (proposals, matched_idxs, labels) for an extra head to compute its own predictions.

Alternatives

EDIT

The alternative below was my initial idea. However, in the meantime I have a found a far simpler solution, which can be found on the first comment of this thread. As such, please feel free to ignore the alternative described here.

END OF EDIT

So far I have the current proposal:

Allow for passing to the constructor of GeneralizedRCNN models (faster_rcnn, mask_rcnn, keypoint_rcnn) a custom transform (similar to GeneralizedRCNNTransform, probably inheriting from it) that handles any necessary transformations to be done to the extra heads' targets (this custom transform might not even be necessary, depending on the extra heads).
Allow for passing to the constructor of GeneralizedRCNN models (faster_rcnn, mask_rcnn, keypoint_rcnn) an instance of a RoiHeadsExtensions, that would inherit from RoIHeads, preserving all its current behaviour but also exposing in the forward pass all the necessary elements (proposals, matched_idxs, labels) for an extra head to compute its own predictions.

Example (for faster_rcnn):

def __init__(self, backbone, num_classes=None,
                 # transform parameters
                 min_size=800, max_size=1333,
                 image_mean=None, image_std=None,

                 transform=None, # NEW

                 # RPN parameters
                 rpn_anchor_generator=None, rpn_head=None,
                 rpn_pre_nms_top_n_train=2000, rpn_pre_nms_top_n_test=1000,
                 rpn_post_nms_top_n_train=2000, rpn_post_nms_top_n_test=1000,
                 rpn_nms_thresh=0.7,
                 rpn_fg_iou_thresh=0.7, rpn_bg_iou_thresh=0.3,
                 rpn_batch_size_per_image=256, rpn_positive_fraction=0.5,
                 # Box parameters
                 box_roi_pool=None, box_head=None, box_predictor=None,
                 box_score_thresh=0.05, box_nms_thresh=0.5, box_detections_per_img=100,
                 box_fg_iou_thresh=0.5, box_bg_iou_thresh=0.5,
                 box_batch_size_per_image=512, box_positive_fraction=0.25,
                 bbox_reg_weights=None,

                 # RoI heads extensions # NEW
                 roi_heads_extensions=None):

Adapting existing code to allow for a custom transform would be as simple as changing, in faster_rcnn, from:

if image_mean is None:
    image_mean = [0.485, 0.456, 0.406]
if image_std is None:
    image_std = [0.229, 0.224, 0.225]
transform = GeneralizedRCNNTransform(min_size, max_size, image_mean, image_std)

super(FasterRCNN, self).__init__(backbone, rpn, roi_heads, transform)

to:

if transform is None:
            if image_mean is None:
                image_mean = [0.485, 0.456, 0.406]
            if image_std is None:
                image_std = [0.229, 0.224, 0.225]
            transform = GeneralizedRCNNTransform(min_size, max_size, image_mean, image_std)

super(FasterRCNN, self).__init__(backbone, rpn, roi_heads, transform)

As for creating the RoiHeadsExtensions class it would be necessary to change the RoiHeads class in the following way:

add, at construction time, an internal parameter that identifies if extensions exist, and by default is false.
```
self.has_extensions = False
```

change the return of forward from:

return result, losses

if self.has_extensions:
        return result, losses, (proposals, matched_idxs, labels)
return result, losses

thus allowing the extensions to access the proposals, matched_idxs and labels.

Now, the RoiHeadsExtensions class itself would simply hold the extra heads and mimick RoiHeads as much as possible. So far I had in mind something like:

class RoIHeadsExtensions(RoIHeads):
    # Note that depending on your extensions, you might have to create your own GeneralizedRCNNTransform.

    def __init__(self, extensions):
        # type: (List[CustomRoIHead])
        self.extensions = extensions
        super(RoIHeads, self).__init__()

    def add_base(self, roi_heads):
        # type: (RoIHeads)

        self.has_extensions = True

        self.box_similarity   = roi_heads.box_similarity # ISSUE EDIT -> 'roi_heads.box_ops.box_iou' was wrong!
        self.proposal_matcher = roi_heads.proposal_matcher
        self.fg_bg_sampler    = roi_heads.fg_bg_sampler
        self.box_coder        = roi_heads.box_coder

        self.box_roi_pool  = roi_heads.box_roi_pool
        self.box_head      = roi_heads.box_head
        self.box_predictor = roi_heads.box_predictor

        self.score_thresh       = roi_heads.score_thresh
        self.nms_thresh         = roi_heads.nms_thresh
        self.detections_per_img = roi_heads.detections_per_img

        has_mask = roi_heads.has_mask()
        self.mask_roi_pool  = roi_heads.mask_roi_pool if has_mask else None
        self.mask_head      = roi_heads.mask_head if has_mask else None
        self.mask_predictor = roi_heads.mask_predictor if has_mask else None

        has_keypoint = roi_heads.has_keypoint()
        self.keypoint_roi_pool  = roi_heads.keypoint_roi_pool if has_keypoint else None
        self.keypoint_head      = roi_heads.keypoint_head if has_keypoint else None
        self.keypoint_predictor = roi_heads.keypoint_predictor if has_keypoint else None

    def forward(self, features, proposals, image_shapes, targets=None):
        # type: (Dict[str, Tensor], List[Tensor], List[Tuple[int, int]], Optional[List[Dict[str, Tensor]]])
        """
        Arguments:
            features (List[Tensor])
            proposals (List[Tensor[N, 4]])
            image_shapes (List[Tuple[H, W]])
            targets (List[Dict])
        """
        result, losses, values_for_extension = super(RoIHeadsExtensions, self).forward(features, proposals, image_shapes, targets)

        for extension in self.extensions:
            extension.forward(result, losses, features, image_shapes, targets, values_for_extension) # ISSUE EDIT -> was missing image_shapes!

        return result, losses

Which would get updated in fasterrcnn by simply adding

if roi_heads_extensions:
    roi_heads_extensions.add_base(roi_heads)
    roi_heads = roi_heads_extensions

to the end of

roi_heads = RoIHeads(
    # Box
    box_roi_pool, box_head, box_predictor,
    box_fg_iou_thresh, box_bg_iou_thresh,
    box_batch_size_per_image, box_positive_fraction,
    bbox_reg_weights,
    box_score_thresh, box_nms_thresh, box_detections_per_img)

yielding:

roi_heads = RoIHeads(
    # Box
    box_roi_pool, box_head, box_predictor,
    box_fg_iou_thresh, box_bg_iou_thresh,
    box_batch_size_per_image, box_positive_fraction,
    bbox_reg_weights,
    box_score_thresh, box_nms_thresh, box_detections_per_img)
if roi_heads_extensions:
    roi_heads_extensions.add_base(roi_heads)
    roi_heads = roi_heads_extensions

As far as I see, this would preserve the existing behaviour of all 3 models and would require minimal changes to mask_rcnn and keypoint_rcnn (just its own parameters and the call to super(), i.e. faster_rcnn), some also small, albeit larger, changes to faster_rcnn (that still preserve its current behaviour) and some more significant changes to roi_heads.py, that nonetheless still preserve its current behaviour.

After messing around with the code a bit more, I have found a far simpler alternative, that requires only a minute amount of changes to roi_heads.py, while still preserving all previous behaviour.

The idea would be to add to the (GeneralizedRCNN) model, after it has been created, the extra heads, by setting the value of an internal variable of the RoIHeads class. This variable starts as None, by default, and remains that way when no extra RoI heads exist. So, in the RoIHeads class __init__ we would add:

self.extra_roi_heads = None

However, the user can manually set the value of the variable to be a torch.nn.ModuleDict that holds the extra RoI heads (the only benefit of using a dict is that it allows to name individual extra heads), eg:

model.roi_heads.extra_roi_heads = user_defined_extra_roi_heads

Other than that, only minimal changes are necessary to get the extra RoI heads to work.

We check the targets of the extra roi heads, by adding

if self.extra_roi_heads is not None:
    for target_type in self.extra_roi_heads:
        assert self.DELTEME_all([target_type in t for t in targets])

to the check_targets function.

And finally, we compute the forward pass of the extra RoI heads, by adding the following to the end of the RoIHeads class forward pass:

# Run the extra heads.
if self.extra_roi_heads is not None:
    values_for_extra_head = proposals, matched_idxs, labels
    for _, extra_roi_head in self.extra_roi_heads.items():
        extra_roi_head(result, losses, features, image_shapes, targets, values_for_extra_head)

The exact behaviour of each extra head is, obviously, left to the user.

This leaves only the matter of ensuring the that when self.extra_roi_heads is set, it is set to a torch.nn.ModuleDict, which could maybe be done by turning self.extra_roi_heads into a property and enforcing the type in the property.setter function.

Hi,

Thanks a lot for the detailed proposal!

This was something that I was considering in the beginning, but one thing that stopped me from pursuing it was that, in principle, the values_for_extra_head is not a really well-defined concept.

For example, a head could also take intermediate activations from the other heads, our the output of the masks to compute a refinement, etc. Because of that, I decided that the code should be as straightforward as possible, because inheriting / modifying the forward of RoIHeads gives you full flexibility.

I agree that there might be ways to refactor a bit more the code to make writing your own forward simpler. One example would be moving these checks to a separate method https://github.com/pytorch/vision/blob/12b551e7a7232d829df0f01ae9f6c56305571dfc/torchvision/models/detection/roi_heads.py#L740-L747

Here is what I would envision for users to extend RoIHeads, via subclassing:

class MyRoIHead(RoIHeads):
    def forward(self, features, proposals, image_shapes, targets=None):
        self.check_targets(targets)
        if self.training:
            proposals, matched_idxs, labels, regression_targets = self.select_training_samples(proposals, targets)
        else:
            labels = None
            regression_targets = None
            matched_idxs = None

        box_features = self.box_roi_pool(features, proposals, image_shapes)
        box_features = self.box_head(box_features)
        class_logits, box_regression = self.box_predictor(box_features)

        # new feature here, using potentially box_features
        ...

There is still a bit of boilerplate code that is required as of now, but maybe those could maybe be refactored into helper methods (like select_training_samples or postprocess_detections) to make subclassing easier, but that would be my first thought on how to extend it.

Thoughts?

@fmassa I think your proposed changes would already help a lot. Also, it might be helpful, if every branch had its own method, so that a customization would look something like this:

class MyRoIHead(RoIHeads):
    def forward(self, features, proposals, image_shapes, targets=None):
        ... = self.forward_preparation(...)

        self.forward_box(...)

        if self.has_mask():
            self.forward_mask(...)

        if self.has_keypoints():
            self.forward_keypoints(...)

        # new branch
        self.forward_custom(...)

This would help to reduce the boilerplate code. What do you think?

Somewhat related:

It might be good to introduce static methods for the classes MaskRCNN, FasterRCNN and KeypointRCNN, so that their characteristic branches can more easily be added to custom RCNN architectures (e.g. if one wanted to combine Mask R-CNN and Keypoint R-CNN).

    class MaskRCNN:
        ...

        @staticmethod
        def get_mask_branch_parts(mask_head, mask_predictor, mask_roi_pool, num_classes, out_channels):
            assert isinstance(mask_roi_pool, (MultiScaleRoIAlign, type(None)))
            if num_classes is not None:
                if mask_predictor is not None:
                    raise ValueError("num_classes should be None when mask_predictor is specified")
            if mask_roi_pool is None:
                mask_roi_pool = MultiScaleRoIAlign(
                    featmap_names=["0", "1", "2", "3"], output_size=14, sampling_ratio=2
                )
            if mask_head is None:
                mask_layers = (256, 256, 256, 256)
                mask_dilation = 1
                mask_head = MaskRCNNHeads(out_channels, mask_layers, mask_dilation)
            if mask_predictor is None:
                mask_predictor_in_channels = 256  # == mask_layers[-1]
                mask_dim_reduced = 256
                mask_predictor = MaskRCNNPredictor(
                    mask_predictor_in_channels, mask_dim_reduced, num_classes
                )
            return mask_head, mask_predictor, mask_roi_pool

@fmassa Thoughts?

EDIT: Alternatively, they could of course also be placed in functions.

pytorch / vision