open-mmlab / mmdetection

OpenMMLab Detection Toolbox and Benchmark
https://mmdetection.readthedocs.io
Apache License 2.0
29.12k stars 9.38k forks source link

Rethinking positional arguments in the API #3939

Open Erotemic opened 3 years ago

Erotemic commented 3 years ago

This is less of a feature request, and more along the lines of: observations of issues I've had when working with the mmdet API and suggestions for how some of these might be refactored to improve the overall package.

One of the biggest challenges of with working with mmdet so far has been its widespread use of positional arguments. This comes in two flavors: function signatures and return values.

The current structure

As an example consider forward_train function in base_dense_head.py and its use of the get_bboxes function:

The signature for get_boxes looks like:

def get_bboxes(self, cls_scores, bbox_preds, img_metas, cfg=None, rescale=False):
    ...

And the head forward function looks somewhat like this:

    def forward(self, x):
        # do stuff to translate backbone features into box+scores
        return cls_scores, bbox_preds

The forward_train function currently looks something like this:

    def forward_train(self,
                      x,
                      img_metas,
                      gt_bboxes,
                      gt_labels=None,
                      gt_bboxes_ignore=None,
                      proposal_cfg=None,
                      **kwargs):
        outs = self(x)
        if gt_labels is None:
            loss_inputs = outs + (gt_bboxes, img_metas)
        else:
            loss_inputs = outs + (gt_bboxes, gt_labels, img_metas)
        losses = self.loss(*loss_inputs, gt_bboxes_ignore=gt_bboxes_ignore)
        if proposal_cfg is None:
            return losses
        else:
            proposal_list = self.get_bboxes(*outs, img_metas, cfg=proposal_cfg)
            return losses, proposal_list

Imagine if you want to extend self.forward to return anything other than a tuple Tuple[cls_scores, bbox_preds]. You have to create a custom get_boxes function that has arguments in an order that agree with the disparate forward function. Perhaps for some people thinking in terms of ordered items is easy, but for me, this is incredibly hard to manage. I would like to suggest an alternative.

The alternative proposal

Imagine if instead of returning a tuple the forward function returned a dictionary where the keys were standardized instead of the positions of the values.

    def forward(self, x):
        # do stuff to translate backbone features into box+scores
        outs = {
            'cls_scores': cls_scores,
            'bbox_preds': bbox_preds,
        }
        return outs

Now, the get_bboxes function doesn't need to care about what particular head was used. It can simply accept the output dictionary and assert that it contains particular keys that that variant of get_bboxes needs. (Note this might allow the head to produce other auxiliary information used in the loss, but not in the construction of boxes)

def get_bboxes(self, output, img_metas, cfg=None, rescale=False):
    # This get_bboxes function requires cls_scores and bbox_pred, but 
    # its ok if your network produces other things as well
    cls_scores = output['cls_scores']
    bbox_preds = output['bbox_preds']
    # output may have an item 'keypoint_preds' but we aren't forced to care
    # about it if we don't specifically need it.
    ...

We can extend this pattern further, so in addition to the forward function producing a dictionary, the forward_train function will produce a dictionary as well.

    def forward_train(self,
                      x,
                      img_metas,
                      gt_bboxes,
                      gt_labels=None,
                      gt_bboxes_ignore=None,
                      proposal_cfg=None,
                      **kwargs):
        outs = self(x)
        losses = self.loss(outs, gt_bboxes=gt_bboxes, gt_labels=gt_labels,
                           img_metas=img_metas, gt_bboxes_ignore=gt_bboxes_ignore)
        train_outputs = {
            'losses': losses,
        }
        if proposal_cfg is not None:
            proposal_list = self.get_bboxes(*outs, img_metas, cfg=proposal_cfg)
            train_outputs['proposal_list'] = proposal_list

        return train_outputs

This has less conditionals and a consistent return type.

This means that function that use forward train can seamlessly switch between setting proposal_cfg and getting the boxes or just getting the loss because the return value have consistent types and access patterns in both modes. If you do need a conditional it can be based on the return value instead of having to remember the inputs.

We could go even further and abstract the labels into a argument called truth that should contain keys: gt_bboxes, and optionally gt_labels and gt_bboxes_ignore, and perhaps that might look like:

    def forward_train(self, x, img_metas, truth, proposal_cfg=None, **kwargs):
        outs = self.forward(x)
        losses = self.loss(outs, img_metas=img_metas, truth=truth)
        train_outputs = {
            'losses': losses,
        }
        if proposal_cfg is not None:
            proposal_list = self.get_bboxes(*outs, img_metas, cfg=proposal_cfg)
            train_outputs['proposal_list'] = proposal_list
        return train_outputs

Discussion

IMO this pattern produces much more readable and extensible code. We can:

I think having a standard set of keywords is much more extensible than sticking to positional based arguments.

There is a small issue of speed. Unpacking dictionaries is slower than unpacking tuples, but I don't think it will be noticeable difference given that every python attribute lookup is a dictionary lookup anyway.

This is a rather large API change, but I think the reliance of positional based arguments is stifling further development of new and exciting networks. I think there might be a way to gradually introduce these changes such that it maintains a decent level of backwards compatibility as well, but I'm not 100% sure on this.

I've played around with variants of this and it works relatively well, the main issue I had was the widespread use of multi_apply, which could likely be changed to assume the containers returned by forward functions are dictionaries instead of tuples.

Summary

In summary I want to propose replacing positional based APIs with keyword based APIs. The main purpose of making this issue is for me to gauge the interest of the core devs. If there is interest I may work on developing this idea further and look into implementations that are amenable to a smooth transition such that backwards compatibility is not broken.

hellock commented 3 years ago

Many thanks to your feedbacks and suggestions! I totally agree with you on this point. Using positional arguments and returning tuples has been a historical issue when we are extending mmdet to more methods and tasks. Actually we had a discussion internally a few months ago, and planed to do a large refactoring by the end of this year (not started yet due to the heavy workload of current maintainers). Here are some things to do:

The last two item are what block us from starting the refactoring since we would like to make it general enough. Glad that you are willing to work on it, which is definitely a large improvement. We can have further discussions and make the new API real in the next one or two releases.

Erotemic commented 3 years ago

I'm glad the core team is on board with this idea. I'm excited to see what the new API will look like.

As a reference / inspiration for the general InstanceAnnotation or SampleAnnotation I have a class I currently make heavy use of called kwimage.Detections: which lives here.

The idea is that it is a lightweight wrapper around a data dictionary (that stores boxes, scores, class indexes, etc.) for a set of detections as well as a meta dictionary (that store the class index to name mapping). It supports axis-aligned bounding boxes, keypoints, polygon and make based segmentations and can be coerced to/from the coco json format. This likely wouldn't be exactly what you use, but perhaps it can serve as an inspiration and offer some lessons learned. One thing that I would recommend is to ensure whatever representation you come up efficiently stores sets of detections for each image --- i.e. the underlying boxes / scores are stored as contiguous torch.Tensor objects.

On the last point, I think it would be nice if users had the option of passing in simple dictionaries (they are easier and more intuitive to construct than custom objects), and then those were coerced to InstanceAnnotation objects. Imagine something like this:

class Annotations(object):
    """
    A class to store multiple annotations / detections in an image
    """
    def __init__(self, bboxes=None, scores=None, labels=None, segmentations=None, keypoints=None, weights=None, ...):
        self._data = {
            'bboxes': bboxes,
            'scores': scores,
            'labels': labels,
            'weights': weights,
            'segmentations': segmentations,
            'keypoints': keypoints,
        }

    @classmethod
    def coerce(cls, data):
        if isinstance(data, cls):
            return data
        elif isinstance(data, dict):
            return cls(**data)
        else:
            raise TypeError(type(data))

    # Custom methods and accessors
    ...

class CustomHead:
    def forward(x, img_metas, annotations):
        # If the user passes in a dict, transform it to an annotations object
        # (This also validates that the correct keywords are used)
        annotations = InstanceAnnotations.coerce(annotations)

Then as a user I could simply pass in

annotations = {
    'bboxes': torch.rand(100, 4),
}

rather than being forced to build a custom annotation object.

hellock commented 3 years ago

Also involve @ZwwWayne in this thread.