Error when running baseline snake

Yinyf0804 commented 3 years ago

Thanks for your work and code! An error occurs when I am running the baseline snake model using the command: CUDA_VISIBLE_DEVICES=0,1 python train_net.py --num-gpus 2 --config-file configs/Dsnake_R_50_1x.yaml

File "/data/yinyf/dance/core/modeling/dsnake_baseline/dsnakehead.py", line 190, in forward , losses = self.refine_head(features["p2"], None, targets) File "/home/fengh/miniconda3/envs/dance/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, **kwargs) File "/data/yinyf/dance/core/modeling/dsnake_baseline/dsnake_head.py", line 771, in forward training_targets = self.compute_targets_for_polys(targets) File "/data/yinyf/dance/core/modeling/dsnake_baseline/dsnake_head.py", line 638, in compute_targets_for_polys init_sample_locations = torch.stack(init_sample_locations, dim=0) RuntimeError: stack expects a non-empty TensorList

How can I deal with it?

lkevinzc commented 3 years ago

Hi @Yinyf0804 , are you using coco dataset or your own dataset?

Yinyf0804 commented 3 years ago

Hi @lkevinzc, i am using coco dataset.

lkevinzc commented 3 years ago

Hi @Yinyf0804 , I have met similar problem on other datasets, but not COCO. The reason for this issue is that the prepared ground truth is not in a correct format (e.g. maybe due to the mask contour vertices are not in order or so). You could just modify the code using a try ... except ... to skip this example for training. This should be the easiest way to fix the issue.

Yinyf0804 commented 3 years ago

Hi @lkevinzc , Could you give me an example of where and how to place the try ... except ...?

lkevinzc commented 3 years ago

Hi @Yinyf0804 , maybe you can try to put a try at this line: https://github.com/lkevinzc/dance/blob/master/core/modeling/dsnake_baseline/af_two_stage.py#L84

If the same error occurs, just not updating the head_losses into the proposal_losses.

Please have a try and see if this can make the training run smoothly :)

Besides, may I know what is the batch number you are using?

Yinyf0804 commented 3 years ago

@lkevinzc Thanks for your advice. I use 4 batch-size with 2 gpus.

Yinyf0804 commented 3 years ago

@lkevinzc Besides how can i run two processes on one machine, since the dist_url has been stabled and the error will occur: RuntimeError: Address already in use

lkevinzc commented 3 years ago

@Yinyf0804 You are welcome :)

For running two training jobs, may take a look at this: https://github.com/facebookresearch/detectron2/issues/91

Yinyf0804 commented 3 years ago

Hi @lkevinzc, an another error occurs when I run the deepsnake baseline: FloatingPointError: Loss became infinite or NaN at iteration=43914! loss_dict = {'loss_fcos_cls': tensor(1.0541, device='cuda:0', grad_fn=<DivBackward0>), 'loss_fcos_loc': tensor(0.4737, device='cuda:0', grad_fn=<DivBackward0>), 'loss_fcos_ctr': tensor(0.6168, device='cuda:0', grad_fn=<DivBackward0>), 'loss_evolve': tensor(0.5679, device='cuda:0', grad_fn=<MulBackward0>), 'loss_init': tensor(nan, device='cuda:0', grad_fn=<MulBackward0>)}.
How can I deal with it?

lkevinzc commented 3 years ago

Hi @Yinyf0804 , actually I haven't met such error when I was running the snake baseline code. Could you check the tensorboard and find out when the loss_init starts to become unstable and surge?

Yinyf0804 commented 3 years ago

Hi @lkevinzc, how to set the batch-size and gpu-nums when you run the snake baseline code?

lkevinzc commented 3 years ago

Hi @Yinyf0804 , you could modify the configurations (configs/Dsnake_R_50_1x.yaml) to overwrite the _BASE_ settings:

SOLVER:
  IMS_PER_BATCH: 8 # your batch size
  BASE_LR: 0.005  # change linearly with the batch size, larger bs needs larger lr
  STEPS: (120000, 160000) # can also linearly change the steps and max_iter as well
  MAX_ITER: 180000
  CHECKPOINT_PERIOD: 20000

Yinyf0804 commented 3 years ago

Hi @lkevinzc , I run the baseline code with this setting on 2 gpus, but only get 26.9 mAP on coco val.
Could you give me some advice? Thanks!

Yinyf0804 commented 3 years ago

Hi @lkevinzc, my config is : Base-dsnake: MODEL: META_ARCHITECTURE: "FcosSnake" MASK_ON: True BACKBONE: NAME: "build_fcos_resnet_fpn_backbone" RESNETS: OUT_FEATURES: ["res2", "res3", "res4", "res5"] FPN: IN_FEATURES: ["res2", "res3", "res4", "res5"] PROPOSAL_GENERATOR: NAME: "FCOS" DATASETS: TRAIN: ("coco_2017_train",) TEST: ("coco_2017_val",) SOLVER: IMS_PER_BATCH: 8 # 2 GPUs BASE_LR: 0.005 # Note that RetinaNet uses a different default learning rate STEPS: (120000, 160000) MAX_ITER: 180000 CHECKPOINT_PERIOD: 10000 INPUT: MIN_SIZE_TRAIN: (640, 672, 704, 736, 768, 800) VERSION: 2

I use the yaml: BASE_: "Base-dsnake.yaml" MODEL: WEIGHTS: "detectron2://ImageNetPretrained/MSRA/R-50.pkl" RESNETS: DEPTH: 50 EDGE_HEAD: CONVS_DIM: 256

lkevinzc commented 3 years ago

Hi @Yinyf0804 , sorry for late reply. Since it's quite long ago, I can't remember exactly where I put my files. I am searching my training history and will update you after I find it. :)

Yinyf0804 commented 3 years ago

Hi @lkevinzc , thanks a lot! By the way, could you give me some advice on how to set the configurations of M1-M3 in Table 1 in the paper?

lkevinzc commented 3 years ago

Hi @Yinyf0804 , is it okay to email me your contact? like telegram, wechat, or others. We can discuss over it more efficiently and I try to provide you some of my draft files.

lkevinzc@gmail.com

lkevinzc / dance

Error when running baseline snake #13