irfanICMLL / structure_knowledge_distillation

The official code for the paper 'Structured Knowledge Distillation for Semantic Segmentation'. (CVPR 2019 ORAL) and extension to other tasks.
BSD 2-Clause "Simplified" License
695 stars 104 forks source link

RuntimeError: value cannot be converted to type float without overflow #10

Open txfs1926 opened 4 years ago

txfs1926 commented 4 years ago

Hi, thanks for your great work! I've tried run the train script to reproduce the result and encountered an error at step 40000th (i.e. the last step).

INFO     [val 512,512] mean_IU:0.685917  IU_array:[0.96660748 0.75127033 0.89316559 0.47657077 0.52247761 0.55418666
 0.61232051 0.69269029 0.89863761 0.62701722 0.91556037 0.746619
 0.50167771 0.92389765 0.5565986  0.66125612 0.67333135 0.35126003
 0.70728074]
INFO     step:40000 G_lr:0.000000 G_loss:218.70508(mc:0.19767 pixelwise:217.88451 pairwise:0.00256) D_lr:0.000000 D_loss:0.06917
Traceback (most recent call last):
  File "train_and_eval.py", line 31, in <module>
    model.optimize_parameters()
  File "/segmentation/structure_knowledge_distillation/networks/kd_model.py", line 171, in optimize_parameters
    self.G_solver.step()
  File "/anaconda3/lib/python3.6/site-packages/torch/optim/sgd.py", line 106, in step
    p.data.add_(-group['lr'], d_p)
RuntimeError: value cannot be converted to type float without overflow: (6.86045e-07,-2.22909e-07)

I made two changes to the train scripts. One is to train the student model without loading ImageNet pre-trained weight. The other is to import InPlaceABN directly from inplace_abn package instead of libs directory, to make this project compatible with PyTorch v1.0 and above. Here is the edited shell script:

is_pi_use=True
is_pa_use=True
is_ho_use=True
lambda_pi=10.0
lambda_d=0.1

# start kd from 0 step with loading the pretrain imgnet model on student 
CUDA_VISIBLE_DEVICES='3' python -m torch.distributed.launch --nproc_per_node 1 train_and_eval.py \
    --gpu 0 \
    --parallel False \
    --random-mirror \
    --random-scale \
    --weight-decay 5e-4 \
    --data-dir '/Datasets/cityscapes' \
    --batch-size 8 \
    --num-steps 40000 \
    --is-student-load-imgnet False \
        --S_resume False \
        --T_ckpt_path 'Teacher_city.pth' \
    --student-pretrain-model-imgnet ./dataset/resnet18-imagenet.pth \
    --pi ${is_pi_use} \
    --pa ${is_pa_use} \
    --ho ${is_ho_use} \
    --lambda-pa 0.5 \
    --pool-scale 0.5 \
    --lambda-pi ${lambda_pi} \
    --lambda-d ${lambda_d} \

Could you help me to solve this problem? Thanks!

aachenhang commented 4 years ago

I got runtime error at the last step, too. But I think your snapshot has been saved so the runtime error is OK.

Your comment is commited on 8 Nov 2019, and the code at this time has a bug Issue11. In CriterionPixelWise() funciton: N,C,W,H = preds_T[0].shape softmax_pred_T = F.softmax(preds_T[0].view(-1,C), dim=1). And this bug is fix on Dec 9, 2019, why your result(IOU0.685917) seems not bad(through it also didn't reach the claimed result)?