juhongm999 / hsnet

Official PyTorch Implementation of Hypercorrelation Squeeze for Few-Shot Segmentation, ICCV 2021
231 stars 43 forks source link

Unfair Comparisons especially on COCO #6

Closed deepAICrazy closed 3 years ago

deepAICrazy commented 3 years ago

https://github.com/juhongm999/hsnet/issues/5

This is not the case actually because you did not fairly compare your results with REPRI and PFENet in the table1 where the results are directly copied from their papers. In https://github.com/mboudiaf/RePRI-for-Few-Shot-Segmentation/blob/master/src/dataset/transform.py#L80 they keep the aspect ratio of the resized images to be the same as the original image, but in your implementation https://github.com/juhongm999/hsnet/blob/e288916debe5290b3e9554fb61e13a474e00f885/data/dataset.py#L25 the images are simply resized to be the aspect ratio 1:1 without keeping the original label.

For my second question, now that all previous methods use 417, 473 or the original sizes for evaluating COCO and PASCAL. I do not understand why did you use size 400 on COCO and PASCAL to create a brand **new** setting and make other people hard to follow to have a fair comparison, even if the size 400 does not bring the best performance according to your words. Normally we should show the setting of the best results. This is true the performance on Pascal will be slightly higher when the training size grows but they are still comparable. And the models of REPRI and PASCAL cannot be directly tested with 1:1 aspect ratio because they are not trained with the images the 1:1 ratios in the non-255 regions. However on COCO I have tested on PFENet. the results will be much lower when it is evaluated with the original labels without resize. it is the same to the results shownin PFENet it is also mentioned by https://github.com/juhongm999/hsnet/issues/1#issuecomment-816485819. So I think it is unfair if you cannot show the COCO results with the original aspects and the original sizes (or 417, 473) to compare with related methods (REPRI, PFENET AND ASGNet and so on) because a smaller size for resizing labels does bring much better performance on COCO.

deepAICrazy commented 3 years ago

It is not all readers' duty for varifying your code if your paper is under review because a totally contradictory finding has been shown in other papers and comments. It will be better if you can show the results of COCO and PASCAL evaluated with the original aspect ratios and more recently used sizes like 473 and the original sizes before editing and closing my issue.

juhongm999 commented 3 years ago

I interpret your claim to mean that our experimental setup is NOT EXACTLY THE SAME as PFENet and RePRI, but it does not directly make our experiments unfair comparisons. In fact, the differences in our experimental setup make the comparisons disadvantageous to ours, while being favorable to the others for three main reasons:

  1. First, ABSOLUTELY NO DATA AUGMENTATION METHODS are used in our method. Note that almost every method that we compare in Tables 1-3 of our paper adopts many different kinds of data augentation techniques such as RandomScale, RandomCrop, RandomRotate, RandomVerticalFlip, RandomHorizontal, and RandomGaussianBlur whereas our method use NONE of them. Even with such little training data, our method significantly outperforms with large margins: 1~6%p mIoU improvements (1-shot) on all the three datasets, clearly demonstrating the superiority of the method.

  2. Although we were aware of the fact that larger image sizes (> 400) typically yield better mIoU results, we used image size of 400 as it's nicely, recursively divided by factor of 2. For hyperparamters, we chose simple numbers as per the design principle we pursue: 'simplicity'.

  3. We barely engineered our model & code, and tried to keep the number of hyperparameters as minimal as possible. Note that in order to bring extra performance gain, many other existing methods use additional engineerings to the models (other than the main methodology they propose). The engineerings incude auxillary training objectives, learning rate/weight decay, superpixels, pseudo labels, and etc. All of these additional modules & engineerings apparently result in additional hyperparmeters to tune (which is burdensome) with little mIoU improvements. We could have added such extra engineerings to bring extra performance improvement but eventually we decided not to. This is because another design principle we pursue is 'minimal dependency'. Also, we have tried our best to provide easily-readable & readily-runnable code by meticulously refactoring the code with very little amount of hyperparameters & arguments. (You can easily notice this when you actually compare our code with others.)

See below reply for more details on these points I made, and the results you requested.

juhongm999 commented 3 years ago

Below we provide our evaluation results on PASCAL-5i and COCO-20i using original image size (with ResNet101 backbone). The superscript 'org' denotes our model evaluated using original image size.

PASCAL-5i: Screenshot from 2021-05-11 01-06-28

COCO-20i: Screenshot from 2021-05-11 01-01-54

We updated our repository so you can reproduce results above. To reproduce the results with original image size, append additional argument '--use_original_imgsize' as below

python test.py '...other arguments...' --use_original_imgsize 

The evaluation results is comparably effective to those from our original setup (evaluation with image size of 400x400) with very slight mIoU drop (0.2%p on PASCAL-5i and 0.1%p on COCO-20i), but it still sets a new state of the art with a large margin (when compared to the other methods [4, 35, 43, 67, 71]). We suspect this slight degradation is because training and testing condition do not match where we train our model with image size of 400x400 and tested with original image size and aspect ratio. We found that such mismatch of training/testing setups resulted in performance drop when evaluated image size of 417 and 473, which implies such issues can be alleviated when trained with the same image size used in testing.

juhongm999 commented 3 years ago

Answer: Indeed, we could've used the setting of the best results. However, given such large performance improvements on all three datasets (~6%p, ~3%p, and ~1%p mIoU improvements on PASCAL-5i, COCO-20i, and FSS-1000 respectively in 1-shot setting && ~5%p, ~7%p, and 0.4%p mIoU improvements on PASCAL-5i, COCO-20i, and FSS-1000 respectively in 5-shot setting), we were satisfied with current experimental setup which doesn't even use any data augmentation techniques; note that most previous methods such as PFENet (https://github.com/Jia-Research-Lab/PFENet/blob/master/util/transform.py) and RePRI (https://github.com/mboudiaf/RePRI-for-Few-Shot-Segmentation/blob/master/src/dataset/transform.py) adopts many different kinds of data augentation such as RandomScale, RandomCrop, RandomRotate, RandomVerticalFlip, RandomHorizontal, RandomGaussianBlur. Our model, however, does not use any of these as seen in our code: https://github.com/juhongm999/hpnet-dev/blob/master/data/dataset.py.

Moreover, instead using image size of 417 or 473, our model takes image size of 400 for two main reasons:

  1. The sizes of 417 and 473 seemed random numbers to us. We tried to find the reason why most previous methods use such numbers but failed to find good demonstration. To us, the numbers seemed heuristic. If you have any reliable sources or papers that demonstrate the meaning of sizes 417 and 473, please share with us. It would be grateful.
  2. (Image size of) 400 is nicely divided by factor of 2 (recursively), resulting in three hypercorrelations with spatial sizes of 50, 25, 13 as demonstrated in our paper (Appendix). In our model design, simplicity and neatness of the model were also huge concerns because such approach (analyzing dense feature correlation with high-dim. convolution) was never been explored in the previous few-shot segmentation work. Even though our model performs comparably effective (with slight performance boost) with input image size of 417 or 473, the improvements were negligible compared to the large performance improvements the current model gained.

We hope our answers helped.

juhongm999 commented 3 years ago

We've added & updated our comments. Do they address your concerns?