leeyeehoo / CSRNet-pytorch

CSRNet: Dilated Convolutional Neural Networks for Understanding the Highly Congested Scenes

642 stars 259 forks source link

Part_B上训练模型不收敛 #12

Open libaikai opened 5 years ago

libaikai commented 5 years ago

环境：win10+ cuda9.0 +Pytorch4.0 （GTX1070）加载了VGG16的预训练参数直接加8倍upsample模型不收敛不加upsample，MAE一直在68左右，lr改过1e-6，1e-7，也不知道怎么回事。有没有什么训练日志什么的，或者训练上的trick

vlad3996 commented 5 years ago

I've obtained 10.42 MAE and 16.89 MSE on part B without augmentation and with other default params from repo.

libaikai commented 5 years ago

can you tell more details, like original lr and how many epochs you got best MAE on part_B. Many thanks!

vlad3996 commented 5 years ago

I've tried to train with higher lr, another optimizes (Adam, Adadelta, COCOB), change loss function during training process, but can't achieve any better results. Finally I've used author's SGD with momentum, lr (1e-7), and after about 160 epochs I could achieve ~ papers results.

libaikai commented 5 years ago

thank you!

liuleiBUAA commented 5 years ago

when I train the model with train.py ,after first ite, the loss is nan,do you have the same problem@libaikai@vlad3996

Epoch: [0][1170/1200] Time 0.709 (0.417) Data 0.020 (0.017) Loss nan (nan)

liuleiBUAA commented 5 years ago

I change the model.py in class CSRNet(nn.Module): def init(self, load_weights=False): to class CSRNet(nn.Module): def init(self, load_weights=True): The model can convergence, however, I cannot get the MAE of 68 in partA and 10.6 in partB do you change the code like this?@vlad3996

vlad3996 commented 5 years ago

@liuleiBUAA your change in init is some kind useless : you can load weights by providing checkpoint via arg --pre :

checkpoint = torch.load(args.pre)

model.load_state_dict(checkpoint['state_dict'])

I don't change almost anything (except hyperparams and loading from checkpoints during training) to obtain ~ paper results (top model after training had about 9.1 on val and 10.2 MAE on test).

Then I rewrite some code to python3 in file molel.py, change some hyperparams in train.py, and image pre-processing in image.py, dataset.py.

P.S. I've obtained 8.02 MAE loss on part B just using pre-training on other dataset and default CSRNet architecture. P.P.S. using dilations on last conv layers lead to artifacts on output heatmap (see https://arxiv.org/pdf/1705.09914.pdf )

liuleiBUAA commented 5 years ago

@vlad3996 Thank you, I try to train the model from the begining. and have you meet the problem? 'CSRNet' object has no attribute 'seen' I have to comment

seen=model.seen

and the train.py can work

vlad3996 commented 5 years ago

@libaikai I've just cloned original repo and run training with python 2.7, pytorch 0.4.1. No errors.

Do you use VGG16 pre-trained weights? It's a little bit tricky to download weights on < python 2.7.9 (I've encountered with error described here, then just download weights from here and placed them manually :

mv vgg16-397923af.pth /home/vladislav.leketush/.torch/models/vgg16-397923af.pth

liuleiBUAA commented 5 years ago

@vlad3996, which dataset have you used to get better result on partB? Do you mean do not use dilated conv in last layer?

wait1988 commented 5 years ago

@vlad3996 Could you tell more specifically about what dou you modify in image.py and dataset.py?And,with these modification,what gain dou you get?Thx.

sxxtaotao commented 5 years ago

你好，我用的环境跟你的差不多，但是我在下载vgg16的参数的时候总是莫名其妙的中断，请问你有出现这个问题吗？@libaikai