w1oves / Rein

[CVPR 2024] Official implement of <Stronger, Fewer, & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation>
https://zxwei.site/rein
GNU General Public License v3.0
250 stars 21 forks source link

Resizing During Training and Eval #8

Closed vivekvjk closed 8 months ago

vivekvjk commented 8 months ago

Hi! I noticed that the train pipeline for dg on gta-->cv train ons 512x512 crops on a downsampled gta image (1280, 720). However, during evaluation on cityscapes, you are evaluating on 512x512 crops on a downsampled cityscapes image (1024, 512).

Was this intended, as evaluation should occur on the original image size for cityscapes (2048, 1024)>

w1oves commented 8 months ago

Thank you for your interest in our work! The question you raised has already been considered in our experimental setup. Since our model was trained on lower-resolution images, it has not been exposed to high-resolution ones, making it unnecessary to test on high-resolution images. Of course, while we did not conduct related experiments, we speculate that the testing performance would improve with high-resolution images, as they contain more information than their lower-resolution counterparts. However, we believe that this change would have a minor impact on performance, which is not central to our core contributions. To facilitate easier evaluation of the model, we chose to assess it on low-resolution images of 1024x512.

vivekvjk commented 8 months ago

Thank you for the quick reply! When reporting other methods and comparing to yours, do you evaluate them under the same setting (at downsampled size)?

w1oves commented 8 months ago

If you're still seeking answers or further clarification, we encourage you to explore our latest checkpoints. Aimed at enhancing real-world applicability and showcasing the exceptional capabilities of our approach, we meticulously carried out two experimental series: synthetic-to-real and cityscapes-to-acdc. We've made available the corresponding checkpoints at Cityscapes and UrbanSyn+GTAV+Synthia, both of which have demonstrated remarkable results. To ensure peak performance, these configurations were rigorously trained and tested using their native resolutions. For usage instructions, refer to the discussion here.

w1oves commented 8 months ago

Thank you for the quick reply! When reporting other methods and comparing to yours, do you evaluate them under the same setting (at downsampled size)?

In our paper, the performance metrics for other methods were sourced directly from their original publications. Since the PEFT and DGSS methods mentioned in Tables 2 and 3 were not adapted for VFMs, we replicated them under the configurations previously described. In other words, every metric presented in each table either comes from its original publication or is obtained under configurations that are strictly identical to those used for our method in the same table.

yasserben commented 8 months ago

Hi! I noticed that the train pipeline for dg on gta-->cv train ons 512x512 crops on a downsampled gta image (1280, 720). However, during evaluation on cityscapes, you are evaluating on 512x512 crops on a downsampled cityscapes image (1024, 512).

Was this intended, as evaluation should occur on the original image size for cityscapes (2048, 1024)>

Hi ! I have two questions regarding the question above :

Thank you for your help

w1oves commented 8 months ago
yasserben commented 8 months ago

Thank you very much for all your answers !! This is really helpful

you said that you provided validation for DINOv2 on 1024x2048 resolution images, but in the link you provied, it's for 1024x1024. Am I missing something ? or is it just a typo ?

Thank you !

w1oves commented 8 months ago

During training, images will be resized to 1024x2048 and then cropped to 1024x1024. For validation, a sliding window of size 1024x1024 is used on the 1024x2048 images. Hence, I refer to it as a checkpoint at 1024x2048.

yasserben commented 8 months ago

Thank you very much for your help !