Closed Yinyf0804 closed 2 years ago
Hi @Yinyf0804 , are you using coco dataset or your own dataset?
Hi @lkevinzc, i am using coco dataset.
Hi @Yinyf0804 , I have met similar problem on other datasets, but not COCO. The reason for this issue is that the prepared ground truth is not in a correct format (e.g. maybe due to the mask contour vertices are not in order or so). You could just modify the code using a try ... except ...
to skip this example for training. This should be the easiest way to fix the issue.
Hi @lkevinzc , Could you give me an example of where and how to place the try ... except ...
?
Hi @Yinyf0804 , maybe you can try to put a try
at this line: https://github.com/lkevinzc/dance/blob/master/core/modeling/dsnake_baseline/af_two_stage.py#L84
If the same error occurs, just not updating the head_losses
into the proposal_losses
.
Please have a try and see if this can make the training run smoothly :)
Besides, may I know what is the batch number you are using?
@lkevinzc Thanks for your advice. I use 4 batch-size with 2 gpus.
@lkevinzc Besides how can i run two processes on one machine, since the dist_url has been stabled and the error will occur:
RuntimeError: Address already in use
@Yinyf0804 You are welcome :)
For running two training jobs, may take a look at this: https://github.com/facebookresearch/detectron2/issues/91
Hi @lkevinzc, an another error occurs when I run the deepsnake baseline:
FloatingPointError: Loss became infinite or NaN at iteration=43914! loss_dict = {'loss_fcos_cls': tensor(1.0541, device='cuda:0', grad_fn=<DivBackward0>), 'loss_fcos_loc': tensor(0.4737, device='cuda:0', grad_fn=<DivBackward0>), 'loss_fcos_ctr': tensor(0.6168, device='cuda:0', grad_fn=<DivBackward0>), 'loss_evolve': tensor(0.5679, device='cuda:0', grad_fn=<MulBackward0>), 'loss_init': tensor(nan, device='cuda:0', grad_fn=<MulBackward0>)}
.
How can I deal with it?
Hi @Yinyf0804 , actually I haven't met such error when I was running the snake baseline code. Could you check the tensorboard and find out when the loss_init
starts to become unstable and surge?
Hi @lkevinzc, how to set the batch-size and gpu-nums when you run the snake baseline code?
Hi @Yinyf0804 , you could modify the configurations (configs/Dsnake_R_50_1x.yaml
) to overwrite the _BASE_
settings:
SOLVER:
IMS_PER_BATCH: 8 # your batch size
BASE_LR: 0.005 # change linearly with the batch size, larger bs needs larger lr
STEPS: (120000, 160000) # can also linearly change the steps and max_iter as well
MAX_ITER: 180000
CHECKPOINT_PERIOD: 20000
Hi @lkevinzc , I run the baseline code with this setting on 2 gpus, but only get 26.9 mAP on coco val.
Could you give me some advice? Thanks!
Hi @lkevinzc, my config is :
Base-dsnake:
MODEL: META_ARCHITECTURE: "FcosSnake" MASK_ON: True BACKBONE: NAME: "build_fcos_resnet_fpn_backbone" RESNETS: OUT_FEATURES: ["res2", "res3", "res4", "res5"] FPN: IN_FEATURES: ["res2", "res3", "res4", "res5"] PROPOSAL_GENERATOR: NAME: "FCOS" DATASETS: TRAIN: ("coco_2017_train",) TEST: ("coco_2017_val",) SOLVER: IMS_PER_BATCH: 8 # 2 GPUs BASE_LR: 0.005 # Note that RetinaNet uses a different default learning rate STEPS: (120000, 160000) MAX_ITER: 180000 CHECKPOINT_PERIOD: 10000 INPUT: MIN_SIZE_TRAIN: (640, 672, 704, 736, 768, 800) VERSION: 2
I use the yaml:
BASE_: "Base-dsnake.yaml" MODEL: WEIGHTS: "detectron2://ImageNetPretrained/MSRA/R-50.pkl" RESNETS: DEPTH: 50 EDGE_HEAD: CONVS_DIM: 256
Hi @Yinyf0804 , sorry for late reply. Since it's quite long ago, I can't remember exactly where I put my files. I am searching my training history and will update you after I find it. :)
Hi @lkevinzc , thanks a lot! By the way, could you give me some advice on how to set the configurations of M1-M3 in Table 1 in the paper?
Hi @Yinyf0804 , is it okay to email me your contact? like telegram, wechat, or others. We can discuss over it more efficiently and I try to provide you some of my draft files.
lkevinzc@gmail.com
Thanks for your work and code! An error occurs when I am running the baseline snake model using the command:
CUDA_VISIBLE_DEVICES=0,1 python train_net.py --num-gpus 2 --config-file configs/Dsnake_R_50_1x.yaml
File "/data/yinyf/dance/core/modeling/dsnake_baseline/dsnakehead.py", line 190, in forward , losses = self.refine_head(features["p2"], None, targets) File "/home/fengh/miniconda3/envs/dance/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, **kwargs) File "/data/yinyf/dance/core/modeling/dsnake_baseline/dsnake_head.py", line 771, in forward training_targets = self.compute_targets_for_polys(targets) File "/data/yinyf/dance/core/modeling/dsnake_baseline/dsnake_head.py", line 638, in compute_targets_for_polys init_sample_locations = torch.stack(init_sample_locations, dim=0) RuntimeError: stack expects a non-empty TensorList
How can I deal with it?