Sparse masks for torchvision maskrcnn - useful for training on big images with small objects

🚀 Feature

Hello everyone,

To have more efficient GPU memory management, I propose it's a nice idea to allow for sparse masks in the torchvision mask rcnn implementation.

Motivation

For tasks involving the prediction of many small objects in large images it becomes increasingly painful to a large, extremely sparse mask for each object. For example, I may have images of size 8k x 8k pixels with between 1 and up to 100 objects of interest of size around 40 x 40 pixels. In this case, the current dataloader tutorial creates by default a dense mask of size 8k x 8k x N_objects, which is extremely sparse <<1% but takes a lot of memory.

Having this feature would facilitate training mask rcnns on much larger images in this scenario.

Pitch

To allow for masks to be defined as torch sparse tensors, in addition to the usual dense tensors.

I think the only thing to be done is to adjust the way the maskrcnn loss is defined and allow it to take both dense or sparse masks, potentially with a if/else depending on the mask instance. https://github.com/pytorch/vision/blob/7d52be76c8eaf02b12338afe0822396ab3547fe2/torchvision/models/detection/roi_heads.py#L101

Alternatives

The current alternative is to cut a big image into many smaller ones to be inside the GPU memory, but this is suboptimal when the objects of interest are rare, and we want to include as many as possible hard negatives in addition to positives.

Additional context

N/A

cc @datumbox

pytorch / vision