Wrong Evaluation script

swamiviv commented 7 years ago

Nice work porting the model. I found that your evaluation code is wrong. You are evaluating on an image by image basis and summing up IoUs across the val. set. That is not how mean IoU is computed. It is accumulated over pixels. Refer to the FCN code here to see what I mean: https://github.com/shelhamer/fcn.berkeleyvision.org/blob/master/score.py

If I change your evaluation to the correct eval. script, your trained model gets a mIoU of only 72.1%. Also, the deeplab resnet-101 model you had ported into torch only gives 75.4% as opposed to the original 76.3% from CAFFE. This might be due to the different preprocessing you do compared to the deeplab guys or might be due to small errors in porting the model.

It will be great if you can confirm this and put out a log in the README saying you are fixing your eval. script.

isht7 commented 7 years ago

Yes, I found out about the mistake in the eval script just recently and I will be fixing it soon. Thank you for pointing this out. The mIoU of 72.1% could be due to issue #4.(will also be fixed soon). There is also one more difference between the caffe version and the pytorch version(during training)- the caffe version does scaling(0.5,0.75,1,1.25,1.5) on some fixed scales, while the pytorch version randomly picks a number between (0.5, 1.3). Randomly picking a number between (0.5, 1.5) does not fit while training on Titan X.

fanq15 commented 7 years ago

Is this bug fixed?

isht7 commented 7 years ago

The fixed evaluation script is in the dev branch. Find the path of this new script in the Results sections of Readme.

isht7 commented 7 years ago

The evaluation script has been fixed now. evalpyt2.py is the correct script. The old script, evalpyt.py, is still there to maintain continuity, and the difference between them has been clearly mentioned in the results section of readme . We get 71.13% mean IOU from the pytorch trained model. train_iter_20000.caffemodel gives 74.39%. The converted .pth model also gives 74.39%. Readme has also been updated to provide scripts to verify each of these performance claim. Please note that in the ground truth images, the label 255 is merged with background during the evaluation because this was done during training also.

isht7 commented 7 years ago

@swamiviv you said(and also in table 4 of the paper) that the .caffemodel gives 76.3% on the val set, but I am getting 74.39% only. Why could this be, is this because I am merging the boundary(255) label with the background? Are you able to get 76.3% yourself? I am using this script. In the end, I print 2 values, the second one is the one which should be considered.

swamiviv commented 7 years ago

Thanks for correcting the eval. code! It is correct that you are getting lesser value since the evaluation script merges the 255 value with bg. You can just let the 255 value as is in the labels when evaluating it. If you look at the 'fast_hist' function, you will find that those pixels with class 255 are ignore automatically. I tried using the .caffemodel in caffe and got the same result. I haven't tried the modified pytorch model. I will give it a try soon. But if you redo this evaluation as I stated here, I am fairly confident you will able to reproduce those numbers.

On Tue, Jul 18, 2017 at 1:11 PM, Isht Dwivedi notifications@github.com wrote:

@swamiviv https://github.com/swamiviv you said(and also in table 4 of the paper) that the .caffemodel gives 76.3% on the val set, but I am getting 74.39% only. Why could this be, is this because I am merging the boundary(255) label with the background? Are you able to get 76.3% yourself? I am using this https://github.com/isht7/pytorch-deeplab-resnet/blob/development/caffe_evalpyt.py script. In the end, I print 2 values, the second one is the one which should be considered.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/isht7/pytorch-deeplab-resnet/issues/5#issuecomment-316132206, or mute the thread https://github.com/notifications/unsubscribe-auth/AGWIeNv9FXyFodVw1zPvmLf4YGNymy-uks5sPOdIgaJpZM4N6Bll .

-- --Swami

isht7 commented 7 years ago

Update: after leaving the 255 label as it is(as suggested by you), I am now getting 75.54% from the train_iter_20000.caffemodel which is still lower than that reported in the paper(76.3%). If you could look into my code to find the possible cause of this, it would be great.

swamiviv commented 7 years ago

I reran my script few hours back with the converted pytorch model (.pth) of the train_iter_20000.caffemodel file and I could reproduce the exact numbers from the paper (76.42%) over the validation set of 1449 images. I will look into your script soon when I get a chance.

On Wed, Jul 19, 2017 at 5:47 PM, Isht Dwivedi notifications@github.com wrote:

Update: after leaving the 255 label as it is(as suggested by you), I am now getting 75.54% from the train_iter_20000.caffemodel which is still lower than that reported in the paper(76.3%). If you could look into my code https://github.com/isht7/pytorch-deeplab-resnet/blob/development/caffe_evalpyt.py to find the possible cause of this, it would be great.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/isht7/pytorch-deeplab-resnet/issues/5#issuecomment-316528036, or mute the thread https://github.com/notifications/unsubscribe-auth/AGWIeIJ4zNGywuyCX5PhWb9PhxJXkWtYks5sPnmNgaJpZM4N6Bll .

-- --Swami

isht7 commented 7 years ago

If you are short on time, may be you could share the script with me and I could look for differences?

swamiviv commented 7 years ago

Sure. Everything else being the same, this is the way each image is processed and evaluated. The only difference I see is read/write of the image and GT that have been done using PIL to exactly preserve the range and the RGB channel order. Let me know if you find anything strange here.

        img = np.zeros((513,513,3));        
        img_temp = np.asarray(Image.open(os.path.join(im_path,i[:-1]+'.jpg')),dtype=np.float32)[:,:,::-1]
        img_original = img_temp
        img_temp -= vgg_mean
        img[:img_temp.shape[0],:img_temp.shape[1],:] = img_temp

        gt = np.asarray(Image.open(os.path.join(gt_path,i[:-1]+'.png')),dtype=np.uint8)

        output_list = model(Variable(torch.from_numpy(img[np.newaxis, :].transpose(0,3,1,2)).float(),volatile = True).cuda())
        interp = nn.UpsamplingBilinear2d(size=(513, 513))
        output = interp(output_list[3]).cpu().data[0].numpy()
        output = output[:,:img_temp.shape[0],:img_temp.shape[1]]
        output = output.transpose(1,2,0)
        output = np.argmax(output,axis = 2)
        hist += fast_hist(gt.flatten(),output.flatten(),num_classes)

isht7 commented 7 years ago

even after using the above code, I get the exact same result as I got before (75.54% mean IOU)! Why could this be happening?

chenyzh28 commented 6 years ago

@isht7, hi, have you solved the problem? Did you get 76.35% as reported in the paper? By the way, how can I convert the gt images and the output images to color images? Or How did you process the gt images at the beginning?

isht7 / pytorch-deeplab-resnet

Wrong Evaluation script #5