pytorch / xla

Enabling PyTorch on XLA Devices (e.g. Google TPU)
https://pytorch.org/xla
Other
2.48k stars 480 forks source link

Exception in device=TPU:7: Cannot access data pointer of Tensor that doesn't have storage, while running FasterRCNN on Colab TPU #2064

Closed salilmishra23 closed 4 years ago

salilmishra23 commented 4 years ago

❓ Questions and Help

Thanks for the great package!

I get the following error when trying to train Faster RCNN on TPU. The code works for GPU.

I'm providing the link to the Colab. https://colab.research.google.com/drive/1ShGj4Uq8eFgXE1jqfzH9-v1UkaKKLuHq?usp=sharing

Exception in device=TPU:7: Cannot access data pointer of Tensor that doesn't have storage
Exception in device=TPU:4: Cannot access data pointer of Tensor that doesn't have storage
Exception in device=TPU:1: Cannot access data pointer of Tensor that doesn't have storage
Exception in device=TPU:3: Cannot access data pointer of Tensor that doesn't have storage
Exception in device=TPU:6: Cannot access data pointer of Tensor that doesn't have storage
Exception in device=TPU:5: Cannot access data pointer of Tensor that doesn't have storage
Exception in device=TPU:2: Cannot access data pointer of Tensor that doesn't have storage
Exception in device=TPU:0: Cannot access data pointer of Tensor that doesn't have storage
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 119, in _start_fn
    fn(gindex, *args)
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 119, in _start_fn
    fn(gindex, *args)
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 119, in _start_fn
    fn(gindex, *args)
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 119, in _start_fn
    fn(gindex, *args)
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 119, in _start_fn
    fn(gindex, *args)
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 119, in _start_fn
    fn(gindex, *args)
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 119, in _start_fn
    fn(gindex, *args)
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 119, in _start_fn
    fn(gindex, *args)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/distrib_parts.py", line 523, in tpu_train
    self.run_pretrain_routine(model)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/distrib_parts.py", line 523, in tpu_train
    self.run_pretrain_routine(model)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/distrib_parts.py", line 523, in tpu_train
    self.run_pretrain_routine(model)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/distrib_parts.py", line 523, in tpu_train
    self.run_pretrain_routine(model)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/distrib_parts.py", line 523, in tpu_train
    self.run_pretrain_routine(model)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 913, in run_pretrain_routine
    self.train()
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/distrib_parts.py", line 523, in tpu_train
    self.run_pretrain_routine(model)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/distrib_parts.py", line 523, in tpu_train
    self.run_pretrain_routine(model)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 913, in run_pretrain_routine
    self.train()
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 913, in run_pretrain_routine
    self.train()
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/distrib_parts.py", line 523, in tpu_train
    self.run_pretrain_routine(model)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 913, in run_pretrain_routine
    self.train()
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 913, in run_pretrain_routine
    self.train()
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 913, in run_pretrain_routine
    self.train()
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 913, in run_pretrain_routine
    self.train()
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 347, in train
    self.run_training_epoch()
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 347, in train
    self.run_training_epoch()
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 347, in train
    self.run_training_epoch()
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 347, in train
    self.run_training_epoch()
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 913, in run_pretrain_routine
    self.train()
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 347, in train
    self.run_training_epoch()
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 419, in run_training_epoch
    _outputs = self.run_training_batch(batch, batch_idx)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 347, in train
    self.run_training_epoch()
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 347, in train
    self.run_training_epoch()
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 419, in run_training_epoch
    _outputs = self.run_training_batch(batch, batch_idx)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 419, in run_training_epoch
    _outputs = self.run_training_batch(batch, batch_idx)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 419, in run_training_epoch
    _outputs = self.run_training_batch(batch, batch_idx)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 419, in run_training_epoch
    _outputs = self.run_training_batch(batch, batch_idx)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 347, in train
    self.run_training_epoch()
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 419, in run_training_epoch
    _outputs = self.run_training_batch(batch, batch_idx)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 596, in run_training_batch
    loss, batch_output = optimizer_closure()
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 596, in run_training_batch
    loss, batch_output = optimizer_closure()
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 419, in run_training_epoch
    _outputs = self.run_training_batch(batch, batch_idx)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 596, in run_training_batch
    loss, batch_output = optimizer_closure()
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 596, in run_training_batch
    loss, batch_output = optimizer_closure()
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 596, in run_training_batch
    loss, batch_output = optimizer_closure()
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 419, in run_training_epoch
    _outputs = self.run_training_batch(batch, batch_idx)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 596, in run_training_batch
    loss, batch_output = optimizer_closure()
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 560, in optimizer_closure
    output_dict = self.training_forward(split_batch, batch_idx, opt_idx, self.hiddens)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 596, in run_training_batch
    loss, batch_output = optimizer_closure()
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 560, in optimizer_closure
    output_dict = self.training_forward(split_batch, batch_idx, opt_idx, self.hiddens)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 596, in run_training_batch
    loss, batch_output = optimizer_closure()
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 560, in optimizer_closure
    output_dict = self.training_forward(split_batch, batch_idx, opt_idx, self.hiddens)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 560, in optimizer_closure
    output_dict = self.training_forward(split_batch, batch_idx, opt_idx, self.hiddens)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 560, in optimizer_closure
    output_dict = self.training_forward(split_batch, batch_idx, opt_idx, self.hiddens)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 732, in training_forward
    output = self.model.training_step(*args)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 560, in optimizer_closure
    output_dict = self.training_forward(split_batch, batch_idx, opt_idx, self.hiddens)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 560, in optimizer_closure
    output_dict = self.training_forward(split_batch, batch_idx, opt_idx, self.hiddens)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 732, in training_forward
    output = self.model.training_step(*args)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 732, in training_forward
    output = self.model.training_step(*args)
  File "<ipython-input-8-af689e443f7b>", line 51, in training_step
    loss_dict = self.model(images, targets)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 560, in optimizer_closure
    output_dict = self.training_forward(split_batch, batch_idx, opt_idx, self.hiddens)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 732, in training_forward
    output = self.model.training_step(*args)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 732, in training_forward
    output = self.model.training_step(*args)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 732, in training_forward
    output = self.model.training_step(*args)
  File "<ipython-input-8-af689e443f7b>", line 51, in training_step
    loss_dict = self.model(images, targets)
  File "<ipython-input-8-af689e443f7b>", line 51, in training_step
    loss_dict = self.model(images, targets)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 732, in training_forward
    output = self.model.training_step(*args)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "<ipython-input-8-af689e443f7b>", line 51, in training_step
    loss_dict = self.model(images, targets)
  File "<ipython-input-8-af689e443f7b>", line 51, in training_step
    loss_dict = self.model(images, targets)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 732, in training_forward
    output = self.model.training_step(*args)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "<ipython-input-8-af689e443f7b>", line 51, in training_step
    loss_dict = self.model(images, targets)
  File "<ipython-input-8-af689e443f7b>", line 51, in training_step
    loss_dict = self.model(images, targets)
  File "/usr/local/lib/python3.6/dist-packages/torchvision/models/detection/generalized_rcnn.py", line 70, in forward
    proposals, proposal_losses = self.rpn(images, features, targets)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "<ipython-input-8-af689e443f7b>", line 51, in training_step
    loss_dict = self.model(images, targets)
  File "/usr/local/lib/python3.6/dist-packages/torchvision/models/detection/generalized_rcnn.py", line 70, in forward
    proposals, proposal_losses = self.rpn(images, features, targets)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torchvision/models/detection/generalized_rcnn.py", line 70, in forward
    proposals, proposal_losses = self.rpn(images, features, targets)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torchvision/models/detection/generalized_rcnn.py", line 70, in forward
    proposals, proposal_losses = self.rpn(images, features, targets)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torchvision/models/detection/rpn.py", line 493, in forward
    boxes, scores = self.filter_proposals(proposals, objectness, images.image_sizes, num_anchors_per_level)
  File "/usr/local/lib/python3.6/dist-packages/torchvision/models/detection/generalized_rcnn.py", line 70, in forward
    proposals, proposal_losses = self.rpn(images, features, targets)
  File "/usr/local/lib/python3.6/dist-packages/torchvision/models/detection/generalized_rcnn.py", line 70, in forward
    proposals, proposal_losses = self.rpn(images, features, targets)
  File "/usr/local/lib/python3.6/dist-packages/torchvision/models/detection/generalized_rcnn.py", line 70, in forward
    proposals, proposal_losses = self.rpn(images, features, targets)
  File "/usr/local/lib/python3.6/dist-packages/torchvision/models/detection/generalized_rcnn.py", line 70, in forward
    proposals, proposal_losses = self.rpn(images, features, targets)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torchvision/models/detection/rpn.py", line 416, in filter_proposals
    keep = box_ops.batched_nms(boxes, scores, lvl, self.nms_thresh)
  File "/usr/local/lib/python3.6/dist-packages/torchvision/models/detection/rpn.py", line 493, in forward
    boxes, scores = self.filter_proposals(proposals, objectness, images.image_sizes, num_anchors_per_level)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torchvision/models/detection/rpn.py", line 493, in forward
    boxes, scores = self.filter_proposals(proposals, objectness, images.image_sizes, num_anchors_per_level)
  File "/usr/local/lib/python3.6/dist-packages/torchvision/ops/boxes.py", line 76, in batched_nms
    keep = nms(boxes_for_nms, scores, iou_threshold)
  File "/usr/local/lib/python3.6/dist-packages/torchvision/models/detection/rpn.py", line 416, in filter_proposals
    keep = box_ops.batched_nms(boxes, scores, lvl, self.nms_thresh)
  File "/usr/local/lib/python3.6/dist-packages/torchvision/models/detection/rpn.py", line 493, in forward
    boxes, scores = self.filter_proposals(proposals, objectness, images.image_sizes, num_anchors_per_level)
  File "/usr/local/lib/python3.6/dist-packages/torchvision/models/detection/rpn.py", line 493, in forward
    boxes, scores = self.filter_proposals(proposals, objectness, images.image_sizes, num_anchors_per_level)
  File "/usr/local/lib/python3.6/dist-packages/torchvision/models/detection/rpn.py", line 493, in forward
    boxes, scores = self.filter_proposals(proposals, objectness, images.image_sizes, num_anchors_per_level)
  File "/usr/local/lib/python3.6/dist-packages/torchvision/ops/boxes.py", line 76, in batched_nms
    keep = nms(boxes_for_nms, scores, iou_threshold)
  File "/usr/local/lib/python3.6/dist-packages/torchvision/models/detection/rpn.py", line 493, in forward
    boxes, scores = self.filter_proposals(proposals, objectness, images.image_sizes, num_anchors_per_level)
  File "/usr/local/lib/python3.6/dist-packages/torchvision/models/detection/rpn.py", line 416, in filter_proposals
    keep = box_ops.batched_nms(boxes, scores, lvl, self.nms_thresh)
  File "/usr/local/lib/python3.6/dist-packages/torchvision/models/detection/rpn.py", line 493, in forward
    boxes, scores = self.filter_proposals(proposals, objectness, images.image_sizes, num_anchors_per_level)
  File "/usr/local/lib/python3.6/dist-packages/torchvision/models/detection/rpn.py", line 416, in filter_proposals
    keep = box_ops.batched_nms(boxes, scores, lvl, self.nms_thresh)
  File "/usr/local/lib/python3.6/dist-packages/torchvision/models/detection/rpn.py", line 416, in filter_proposals
    keep = box_ops.batched_nms(boxes, scores, lvl, self.nms_thresh)
  File "/usr/local/lib/python3.6/dist-packages/torchvision/models/detection/rpn.py", line 416, in filter_proposals
    keep = box_ops.batched_nms(boxes, scores, lvl, self.nms_thresh)
  File "/usr/local/lib/python3.6/dist-packages/torchvision/ops/boxes.py", line 36, in nms
    return torch.ops.torchvision.nms(boxes, scores, iou_threshold)
  File "/usr/local/lib/python3.6/dist-packages/torchvision/ops/boxes.py", line 36, in nms
    return torch.ops.torchvision.nms(boxes, scores, iou_threshold)
  File "/usr/local/lib/python3.6/dist-packages/torchvision/models/detection/rpn.py", line 416, in filter_proposals
    keep = box_ops.batched_nms(boxes, scores, lvl, self.nms_thresh)
  File "/usr/local/lib/python3.6/dist-packages/torchvision/ops/boxes.py", line 76, in batched_nms
    keep = nms(boxes_for_nms, scores, iou_threshold)
  File "/usr/local/lib/python3.6/dist-packages/torchvision/ops/boxes.py", line 76, in batched_nms
    keep = nms(boxes_for_nms, scores, iou_threshold)
RuntimeError: Cannot access data pointer of Tensor that doesn't have storage
  File "/usr/local/lib/python3.6/dist-packages/torchvision/models/detection/rpn.py", line 416, in filter_proposals
    keep = box_ops.batched_nms(boxes, scores, lvl, self.nms_thresh)
  File "/usr/local/lib/python3.6/dist-packages/torchvision/ops/boxes.py", line 76, in batched_nms
    keep = nms(boxes_for_nms, scores, iou_threshold)
RuntimeError: Cannot access data pointer of Tensor that doesn't have storage
  File "/usr/local/lib/python3.6/dist-packages/torchvision/ops/boxes.py", line 76, in batched_nms
    keep = nms(boxes_for_nms, scores, iou_threshold)
  File "/usr/local/lib/python3.6/dist-packages/torchvision/ops/boxes.py", line 36, in nms
    return torch.ops.torchvision.nms(boxes, scores, iou_threshold)
  File "/usr/local/lib/python3.6/dist-packages/torchvision/ops/boxes.py", line 76, in batched_nms
    keep = nms(boxes_for_nms, scores, iou_threshold)
  File "/usr/local/lib/python3.6/dist-packages/torchvision/ops/boxes.py", line 36, in nms
    return torch.ops.torchvision.nms(boxes, scores, iou_threshold)
  File "/usr/local/lib/python3.6/dist-packages/torchvision/ops/boxes.py", line 36, in nms
    return torch.ops.torchvision.nms(boxes, scores, iou_threshold)
  File "/usr/local/lib/python3.6/dist-packages/torchvision/ops/boxes.py", line 76, in batched_nms
    keep = nms(boxes_for_nms, scores, iou_threshold)
  File "/usr/local/lib/python3.6/dist-packages/torchvision/ops/boxes.py", line 36, in nms
    return torch.ops.torchvision.nms(boxes, scores, iou_threshold)
RuntimeError: Cannot access data pointer of Tensor that doesn't have storage
RuntimeError: Cannot access data pointer of Tensor that doesn't have storage
  File "/usr/local/lib/python3.6/dist-packages/torchvision/ops/boxes.py", line 36, in nms
    return torch.ops.torchvision.nms(boxes, scores, iou_threshold)
RuntimeError: Cannot access data pointer of Tensor that doesn't have storage
  File "/usr/local/lib/python3.6/dist-packages/torchvision/ops/boxes.py", line 36, in nms
    return torch.ops.torchvision.nms(boxes, scores, iou_threshold)
RuntimeError: Cannot access data pointer of Tensor that doesn't have storage
RuntimeError: Cannot access data pointer of Tensor that doesn't have storage
RuntimeError: Cannot access data pointer of Tensor that doesn't have storage
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-10-89d1f2d2d5a9> in <module>()
----> 1 trainer.fit(lit_model)

3 frames
/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloader, val_dataloaders)
    775 
    776             # train
--> 777             xmp.spawn(self.tpu_train, args=(model,), nprocs=self.num_tpu_cores, start_method=start_method)
    778 
    779             # load weights if not interrupted

/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py in spawn(fn, args, nprocs, join, daemon, start_method)
    180         join=join,
    181         daemon=daemon,
--> 182         start_method=start_method)

/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py in start_processes(fn, args, nprocs, join, daemon, start_method)
    156 
    157     # Loop on join until it returns True or raises an exception.
--> 158     while not context.join():
    159         pass
    160 

/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py in join(self, timeout)
    111                 raise Exception(
    112                     "process %d terminated with exit code %d" %
--> 113                     (error_index, exitcode)
    114                 )
    115 

Exception: process 5 terminated with exit code 17
dlibenzi commented 4 years ago

I think the culprit is this:

return torch.ops.torchvision.nms(boxes, scores, iou_threshold)

That ops is a C++ implemented op, which requires storage.

salilmishra23 commented 4 years ago

Is there any workaround or fix for this?

jysohn23 commented 4 years ago

One way I believe, would be to sync back the tensors to CPU instead of PyTorch/XLA TPU and then running nms.

jysohn23 commented 4 years ago

To add an example:

device = boxes.device  # TPU device that it's originally in.
xm.mark_step()  # materialize computation results up to NMS
boxes_cpu = boxes.cpu().clone()  # move to CPU from TPU
scores_cpu = scores.cpu().clone()  # ditto
keep = torch.ops.torchvision.nms(boxes_cpu, scores_cpu, iou_threshold)  # runs on CPU
keep = keep.to(device=device)  # send back to TPU so that rest of model pass is run on TPU

Note that this will likely be slow due to TPU<>CPU communication.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.