rwightman / efficientdet-pytorch

A PyTorch impl of EfficientDet faithful to the original Google impl w/ ported weights
Apache License 2.0
1.59k stars 292 forks source link

tpu problem with class values #27

Closed romanoss closed 4 years ago

romanoss commented 4 years ago

Hi, I try to get EfficientDet running on Kaggle TPUs following Alex Shonenkov's kernel

I am rather a beginner with python and pytorch - sorry...

the model runs ok on GPU - is it possible, that there is a problem with num_classes=1?

the call stack is like:

`def get_net(imgsize=IMG_SIZE, use_checkpoint=None): config = get_efficientdet_config('tf_efficientdet_d4') net = EfficientDet(config, pretrained_backbone=False) checkpoint = torch.load('../input/efficientdet/efficientdet_d4-5b370b7a.pth') net.load_state_dict(checkpoint) config.num_classes = 1 config.image_size = IMG_SIZE net.class_net = HeadNet(config, num_outputs=config.num_classes, norm_kwargs=dict(eps=.001, momentum=.01))

return DetBenchTrain(net, config)`

and I call

def _mp_fn(rank, flags): global acc_list torch.set_default_tensor_type('torch.FloatTensor') a = run_training() FLAGS={} xmp.spawn(_mp_fn, args=(FLAGS,), nprocs=1, start_method='fork') #8

Error looks like:

Exception in device=TPU:0: Class values must be non-negative.

Traceback (most recent call last): File "/opt/conda/lib/python3.7/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 231, in _start_fn fn(gindex, args) File "", line 8, in _mp_fn a = run_training() File "", line 76, in run_training fitter.fit(train_loader, val_loader) File "", line 40, in fit summary_loss = self.train_one_epoch(train_loader) File "", line 106, in train_oneepoch loss, , _ = self.model(images, boxes, labels) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 577, in call result = self.forward(input, kwargs) File "../input/timm-efficientdet-pytorch/effdet/bench.py", line 93, in forward gt_class_out, gt_box_out, num_positive = self.anchor_labeler.label_anchors(gt_boxes[i], gt_labels[i]) File "../input/timm-efficientdet-pytorch/effdet/anchors.py", line 343, in label_anchors clstargets, , boxtargets, , matches = self.target_assigner.assign(anchor_box_list, gt_box_list, gt_labels) File "../input/timm-efficientdet-pytorch/effdet/object_detection/target_assigner.py", line 140, in assign match = self._matcher.match(match_quality_matrix, params) File "../input/timm-efficientdet-pytorch/effdet/object_detection/matcher.py", line 212, in match return Match(self._match(similarity_matrix, **params)) File "../input/timm-efficientdet-pytorch/effdet/object_detection/argmax_matcher.py", line 155, in _match return _match_when_rows_are_non_empty() File "../input/timm-efficientdet-pytorch/effdet/object_detection/argmax_matcher.py", line 144, in _match_when_rows_are_non_empty force_match_column_indicators = one_hot(force_match_column_ids, similarity_matrix.shape[1]) RuntimeError: Class values must be non-negative.

rwightman commented 4 years ago

@romanoss so you tried the exact same code (same versions of everying) on a GPU with no problem? not just relying on the results of Alex' or someone elses run or different checkout versions of the code?

I've never tried this with XLA (for TPU). I do know that the anchor assign/matching code doesn't trace/jit properly, so it may have issues with the XLA lazy evaluation. Possibility of a bug but seems like such a bug should also happen on the GPU. I likely won't have time to look into this for a while, TPUs aren't part of my normal workflows....

romanoss commented 4 years ago

I run this code on GPU without problems and added some if USE_TPU stuff as described for pytorch-xla. Thought you could have an idea, but I am too noob to debug this and I understand the prio of your workflow. thx and keep on with your good work :)

rwightman commented 4 years ago

While I haven't verified that TPU works, the specific sticking point here should no longer be an issue. By default the AnchorLabeler / target_assigner is now running on the Dataloader threads (via my custom collate fn) on the CPU. So this code will longer be run on the TPU (compiled by XLA) unless bench_labeler is set to True in config.

The above change allowed torcscript to be used on the train bench (with some other changes) so TPU should also be much closer to working (or possibly working already).