jfzhang95 / pytorch-deeplab-xception

DeepLab v3+ model in PyTorch. Support different backbones.
MIT License
2.92k stars 783 forks source link

Unable to reproduce result on VOC with small batch size #49

Open xinario opened 5 years ago

xinario commented 5 years ago

Hi, thanks for releasing this great repo. I have a problem in reproducing the result on VOC dataset.

I noticed that your released pretrained model with ResNet | 16/16 | gives 78.43%. But when I trained with a batch size of 2 (cause I only have one GPU), the mIoU is really bad. I'm just wondering if it's necessary to use large batch size in training deeplabv3+.

jfzhang95 commented 5 years ago

Hi,

Did you use both voc and sbd datasets to train your model?

xinario commented 5 years ago

Yeah, I did.

On Dec 28, 2018, at 9:18 PM, Pyjcsx notifications@github.com<mailto:notifications@github.com> wrote:

Hi,

Did you use both voc and sbd datasets to train your model?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/jfzhang95/pytorch-deeplab-xception/issues/49#issuecomment-450461298, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AEucEAksku54i8P4Ry0yVMXsSdkaS5JIks5u9t76gaJpZM4ZkGbR.

Chenfeng1271 commented 5 years ago

I set the batch-size to 7 and train the resnet-deeplab in one GPU, the result as he has said is not very good. Does the large batch necessary?

herleeyandi commented 5 years ago

Same problem I am using this setting since I have only 1 GPU and the batch size=2. The best mIoU is 0.4241, I also using the sbd dataset. CUDA_VISIBLE_DEVICES=0 python train.py --backbone resnet --lr 0.007 --workers 4 --use-sbd --epochs 50 --batch-size 2 --gpu-ids 0 --checkname deeplab-resnet --eval-interval 1 --dataset pascal

image

Pyten commented 5 years ago

@jfzhang95 hi, thanks for releasing this great repo. I also meet the problem as @herleeyandi mentioned. The first time I trained resnet-deeplab with 4 GPUs only on voc2012, I got the following result finally,which is lower than yours.[Acc:0.9367393799223944, Acc_class:0.8456251915935047, mIoU:0.7503445318159087, fwIoU: 0.8860949569691601] Then, I trained with voc2012 and SBD downloaded from http://home.bharathh.info/pubs/codes/SBD/download.html, which is said to contain annotations from 11355 images taken from the PASCAL VOC 2011 dataset. But at last, I got a much worse result, [=>Epoches 49, learning rate = 0.0002, previous best = 0.6884 Train loss: 0.019: [Epoch: 49, numImages: 10582] Loss: 25.682 Test loss: 0.170: Validation: [Epoch: 49, numImages: 1449] Acc:0.7204461224682133, Acc_class:0.15227327091506154, mIoU:0.13477149028396554, fwIoU: 0.5229250562245572 Loss: 30.954] However, I only change the batch_size from 16 to 8 in my experiments, while the other parameters remain the same as yours. So I am wondering how this happens. Please give me some help.

MCDM2018 commented 5 years ago

@PTL2011 Before leaving my opinion, I don't want to talk down his excellent repository. I think he has a mistake. Actually, many people already suffered from that suggestion. So as you did at first, we should train the model with only VOC 2012

opee007 commented 5 years ago

@jfzhang95 hi, thanks for releasing this great repo. I also meet the problem as @herleeyandi mentioned. The first time I trained resnet-deeplab with 4 GPUs only on voc2012, I got the following result finally,which is lower than yours.[Acc:0.9367393799223944, Acc_class:0.8456251915935047, mIoU:0.7503445318159087, fwIoU: 0.8860949569691601] Then, I trained with voc2012 and SBD downloaded from http://home.bharathh.info/pubs/codes/SBD/download.html, which is said to contain annotations from 11355 images taken from the PASCAL VOC 2011 dataset. But at last, I got a much worse result, [=>Epoches 49, learning rate = 0.0002, previous best = 0.6884 Train loss: 0.019: [Epoch: 49, numImages: 10582] Loss: 25.682 Test loss: 0.170: Validation: [Epoch: 49, numImages: 1449] Acc:0.7204461224682133, Acc_class:0.15227327091506154, mIoU:0.13477149028396554, fwIoU: 0.5229250562245572 Loss: 30.954] However, I only change the batch_size from 16 to 8 in my experiments, while the other parameters remain the same as yours. So I am wondering how this happens. Please give me some help.

hello,I have encountered the same problem. Can you solve it?