pytorch / vision

Datasets, Transforms and Models specific to Computer Vision
https://pytorch.org/vision
BSD 3-Clause "New" or "Revised" License
16.25k stars 6.96k forks source link

Torchvision Object detection TPU Support #2486

Open oke-aditya opened 4 years ago

oke-aditya commented 4 years ago

❓ Torchvision object detection models with TPU.

My doubt lies somewhere between feature request and question. hence posting here.

PyTorch supports TPU through torch_xla. It makes it possible to train models over TPU. I guess most torchvision classification models can be used with transfer learning/training over TPU.

For torchvision object detection models, do they support TPU? Some operations such as NMS, rpn, roi_align do not support TPU and hence I get an error as follows.

I was trying Faster R-CNN resnet50 fpn model for object detection.

  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torchvision/models/detection/generalized_rcnn.py", line 70, in forward
    proposals, proposal_losses = self.rpn(images, features, targets)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torchvision/models/detection/rpn.py", line 493, in forward
    boxes, scores = self.filter_proposals(proposals, objectness, images.image_sizes, num_anchors_per_level)
  File "/usr/local/lib/python3.6/dist-packages/torchvision/models/detection/rpn.py", line 416, in filter_proposals
    keep = box_ops.batched_nms(boxes, scores, lvl, self.nms_thresh)
RuntimeError: Cannot access data pointer of Tensor that doesn't have storage

My doubts/concerns/feature request.

  1. Do torchvision object detection models support TPU training?
  2. Any Plans for TPU support in future releases for these models?
  3. Are these ops only CUDA native and GPU/CPU specific? Is there a work-around to train object detection / segmentation models with TPU?
oke-aditya commented 4 years ago

@pmeier I added this to discussion here. Maybe it will need some time and more thorough thought.

oke-aditya commented 4 years ago

@pmeier Any thoughts or updates ? I guess it would be important feature addition. This will allow to train future models in object detection, keypoint detection and semantic segmentation over TPU.

fmassa commented 4 years ago

Hi,

I don't think that we currently support TPUs for the detection models. I believe part of the difficulty lies in the fact that these models have dynamic shapes, which are not very well-suited for TPUs. Additionally, we also have custom ops in torchvision for those models, which I don't think have a direct TPU mapping.

@ailzhang can you chime in with more details?

ailzhang commented 4 years ago

Hi for the custom ops implemented in torchvision with only CPU & CUDA impl (instead of pytorch native ops), we currently support them also as custom ops in pytorch/xla upon request. For example, we added nms https://github.com/pytorch/xla/blob/d5b0b4e077496bb5cfaf823cc07f6f371b1a2af6/torch_xla/core/functions.py#L87 support upon user request. Feel free to open feature requests in pytorch/xla for other ops, we'll put them in our todo list to implement.

oke-aditya commented 4 years ago

As @fmassa pointed out. It might not be suitable for TPUs, for dynamic shapes. I have a few doubts though.

Assume that user has fixed size of images, (through transforms or preprocessing) and feeds it as input to detection models. Will that also not be suited for TPUs?

Is it because torchvision uses GeneralizedRCNNTransform() for detection models?

Will it be advantageous and feasible to support the additional torchvision ops using xla that are needed for detection? Can we add them to enable TPU support?

Currently box_ops.batched_nms is not supported I guess, and hence I'm getting the error.

ofekp commented 4 years ago

I managed to switch box_ops.batched_nms with nms from torch_xla but then ran into the roi_align method which also has no support in XLA devices and is also not implemented in torch_xla 🤦‍♂️ This is the implementation for roi_align in torchvision. How hard will it be to implement it for XLA?

oke-aditya commented 4 years ago

@ofekp

I guess best would be to raise an issue to XLA team regarding this. Maybe they will add these ops to XLA compatibility.

I'm still unsure how much it will benefit while training as fmassa pointed out.

ofekp commented 4 years ago

@oke-aditya I opened and issue in pytorch/xla - https://github.com/pytorch/xla/issues/2487.

simonm3 commented 3 years ago

This would be useful not just for training but also inference on google coral tpu. There is an xla version of nms but how does it work with torchvision? I have tried to adapt a forked version of torchvision but was unable to get it to work - it just hangs.