Problem when training code for center net only

zxk19981227 commented 2 years ago

when i trained this model, question raised that : ` -- Process 1 terminated with the following error: Traceback (most recent call last): File "/home/zhouxukun/miniconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, args) File "/data1/zhouxukun/fcsgg/detectron2/detectron2/engine/launch.py", line 94, in _distributed_worker main_func(args) File "/data1/zhouxukun/fcsgg/tools/train_net.py", line 148, in main return trainer.train() File "/data1/zhouxukun/fcsgg/detectron2/detectron2/engine/defaults.py", line 410, in train super().train(self.start_iter, self.max_iter) File "/data1/zhouxukun/fcsgg/detectron2/detectron2/engine/train_loop.py", line 142, in train self.run_step() File "/data1/zhouxukun/fcsgg/detectron2/detectron2/engine/train_loop.py", line 235, in run_step loss_dict = self.model(data) File "/home/zhouxukun/miniconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, kwargs) File "/home/zhouxukun/miniconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 886, in forward output = self.module(*inputs[0], *kwargs[0]) File "/home/zhouxukun/miniconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, kwargs) File "/data1/zhouxukun/fcsgg/fcsgg/modeling/meta_arch/onestage_detector.py", line 297, in forward self.preprocess_gt(gt_scene_graphs, images.tensor.shape[-2:], image_ids) File "/data1/zhouxukun/fcsgg/fcsgg/modeling/meta_arch/onestage_detector.py", line 261, in preprocess_gt gt_scene_graphs[i] = self.gt_gen(x, image_size, image_id, training=self.training) File "/data1/zhouxukun/fcsgg/fcsgg/data/detection_utils.py", line 639, in call training=training) File "/data1/zhouxukun/fcsgg/fcsgg/data/detection_utils.py", line 540, in generate_gt_scale range_side = torch.tensor(size_range) RuntimeError: Could not infer dtype of NoneType

` How to solve it?

zxk19981227 commented 1 year ago

Also when i tried to use different scale, for example, with no output scale, the model tends to get nan value at 12 iteration. So any helpful solutions?

zxk19981227 commented 1 year ago

Also what does the function : check_image_size defined for ? i tried to train the model but failed several times by this functions?

liuhengyue commented 1 year ago

Please attach the config you were using. Loss becomes NaN normally is because you set a too large learning rate. check_image_size you may need to check detectron2, I remember it checks if the image size (tensor shape) matches what is defined in the dataset dict.

zxk19981227 commented 1 year ago

Please attach the config you were using. Loss becomes NaN normally is because you set a too large learning rate. check_image_size you may need to check detectron2, I remember it checks if the image size (tensor shape) matches what is defined in the dataset dict.

The config file is Base-CenterNet.yaml

liuhengyue commented 1 year ago

You could try set cfg.INPUT.GT_SCALE_AWARE to False.

liuhengyue / fcsgg

Problem when training code for center net only #10