Closed romanoss closed 4 years ago
@romanoss so you tried the exact same code (same versions of everying) on a GPU with no problem? not just relying on the results of Alex' or someone elses run or different checkout versions of the code?
I've never tried this with XLA (for TPU). I do know that the anchor assign/matching code doesn't trace/jit properly, so it may have issues with the XLA lazy evaluation. Possibility of a bug but seems like such a bug should also happen on the GPU. I likely won't have time to look into this for a while, TPUs aren't part of my normal workflows....
I run this code on GPU without problems and added some if USE_TPU stuff as described for pytorch-xla. Thought you could have an idea, but I am too noob to debug this and I understand the prio of your workflow. thx and keep on with your good work :)
While I haven't verified that TPU works, the specific sticking point here should no longer be an issue. By default the AnchorLabeler / target_assigner is now running on the Dataloader threads (via my custom collate fn) on the CPU. So this code will longer be run on the TPU (compiled by XLA) unless bench_labeler
is set to True in config.
The above change allowed torcscript to be used on the train bench (with some other changes) so TPU should also be much closer to working (or possibly working already).
Hi, I try to get EfficientDet running on Kaggle TPUs following Alex Shonenkov's kernel
I am rather a beginner with python and pytorch - sorry...
the model runs ok on GPU - is it possible, that there is a problem with num_classes=1?
the call stack is like:
`def get_net(imgsize=IMG_SIZE, use_checkpoint=None): config = get_efficientdet_config('tf_efficientdet_d4') net = EfficientDet(config, pretrained_backbone=False) checkpoint = torch.load('../input/efficientdet/efficientdet_d4-5b370b7a.pth') net.load_state_dict(checkpoint) config.num_classes = 1 config.image_size = IMG_SIZE net.class_net = HeadNet(config, num_outputs=config.num_classes, norm_kwargs=dict(eps=.001, momentum=.01))
return DetBenchTrain(net, config)`
and I call
def _mp_fn(rank, flags): global acc_list torch.set_default_tensor_type('torch.FloatTensor') a = run_training() FLAGS={} xmp.spawn(_mp_fn, args=(FLAGS,), nprocs=1, start_method='fork') #8
Error looks like:
Exception in device=TPU:0: Class values must be non-negative.
Traceback (most recent call last): File "/opt/conda/lib/python3.7/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 231, in _start_fn fn(gindex, args) File "", line 8, in _mp_fn
a = run_training()
File "", line 76, in run_training
fitter.fit(train_loader, val_loader)
File "", line 40, in fit
summary_loss = self.train_one_epoch(train_loader)
File "", line 106, in train_oneepoch
loss, , _ = self.model(images, boxes, labels)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 577, in call
result = self.forward( input, kwargs)
File "../input/timm-efficientdet-pytorch/effdet/bench.py", line 93, in forward
gt_class_out, gt_box_out, num_positive = self.anchor_labeler.label_anchors(gt_boxes[i], gt_labels[i])
File "../input/timm-efficientdet-pytorch/effdet/anchors.py", line 343, in label_anchors
clstargets, , boxtargets, , matches = self.target_assigner.assign(anchor_box_list, gt_box_list, gt_labels)
File "../input/timm-efficientdet-pytorch/effdet/object_detection/target_assigner.py", line 140, in assign
match = self._matcher.match(match_quality_matrix, params)
File "../input/timm-efficientdet-pytorch/effdet/object_detection/matcher.py", line 212, in match
return Match(self._match(similarity_matrix, **params))
File "../input/timm-efficientdet-pytorch/effdet/object_detection/argmax_matcher.py", line 155, in _match
return _match_when_rows_are_non_empty()
File "../input/timm-efficientdet-pytorch/effdet/object_detection/argmax_matcher.py", line 144, in _match_when_rows_are_non_empty
force_match_column_indicators = one_hot(force_match_column_ids, similarity_matrix.shape[1])
RuntimeError: Class values must be non-negative.