wangf3014 / SCLIP

Official implementation of SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference
131 stars 9 forks source link

Question about performance and hyparameter #5

Open changhaonan opened 10 months ago

changhaonan commented 10 months ago

Hello,

I am testing the performance on some sample images from ADE20k (e.g. ADE_test_00000002.jpg, ADE_test_00000005.jpg). I am using the vocabulary from "cls_ade20k.txt" for query.

I first checked the raw quality of the dense feature without smoothing by setting parameters to be pamr_steps = 0, slide_crop = 0. The results seem quite noisy.

image image

After setting pamr_steps = 1, slide_crop = 227. The results seem to get improved, though still very different from the demos showing in the paper. image image

I'm not sure if you see similar results on your side, or perhaps I did something wrong during the inference. Are there other hyper-parameters I can tune? It would be great if you can provide a script for single image inference. Thanks!

wangf3014 commented 10 months ago

Hi, thank you for your questions.

In this two cases, your configurations are not sufficient to attain good qualitative results. ADE20k is the most challenging benckmark in our experiments, with only 16.45% mIoU for its validation split (although this number has already been current sota for zero-shot open-vocabulary semantic segmentation). For ADE20k images, there are 150 potential categories but the image resolutions are relatively low (around the same level as PASCAL VOC). So, if you desire favorable visualization results, you can

1) Scale up the images (SCLIP has a very good scalability), and/or 2) Reduce the number of categories (That's the advantage of open-vocab segmentation).

I simply tried the following configs

slide_stride=56,
slide_crop=224,
pamr_steps=10,

and classnames for ADE_test_00000002.jpg:

airstrip
sky
airplane
car
grass

and classnames for ADE_test_00000005.jpg:

room, store, escalator
person
carpet
chair
screen
bag

You will get the following two qualitative results: test_ADE_test_00000002 jpg_1 test_ADE_test_00000005 jpg_1

wangf3014 commented 10 months ago

Here are some more details about the hyper-parameters. By default, we resize the images in ADE20k to have a shorter side of 336, and then perform slide inference with a 224-size window and 112 stride. However, it does not means this setup is optimal for our SCLIP, but instead we use it for fair comparisons to the exsting baselines. Actually, SCLIP has a very good scalability to image size. So we can improve both qualitative and quantitative result by scaling up the image size and window size, and can de-noise the segmentation results by reducing the sliding stride and applying pamr.

And don't worry, I will add a single-image inference script and its corresponding configurations later.

chenqi1126 commented 10 months ago

Hi, @wangf3014, thanks for your work. I only get a poor performance (22% in VOC21) when removing the sliding windows, which is far from 59% with sliding windows. Many predictions of images seem all background. Could you please explain this problem? Thanks.

wangf3014 commented 10 months ago

Hi @chenqi1126, thank you for your question. This poor performance is caused by one of the de-noising strategies. To avoid this, you can simply disable the "area_thd" parameter and then you will get 58.5% mIoU with non-sliding inference. Specifically, the model configuration is

model = dict( name_path='./configs/cls_voc21.txt', logit_scale=65, prob_thd=0.1, slide_crop=0 )

The "area_thd" mode is a de-noising strategy for tasks with a background category, where it force parts of the prediction results with a total area less than this threshold to be set as the background category. Without sliding inference, the results are often noisy and there will be many small areas, so the "area_thd" mode is not suitable for this protocol.

chenqi1126 commented 10 months ago

It works in your configuration. Thank you for your reply.

mc-lan commented 10 months ago

Here are some more details about the hyper-parameters. By default, we resize the images in ADE20k to have a shorter side of 336, and then perform slide inference with a 224-size window and 112 stride. However, it does not means this setup is optimal for our SCLIP, but instead we use it for fair comparisons to the exsting baselines. Actually, SCLIP has a very good scalability to image size. So we can improve both qualitative and quantitative result by scaling up the image size and window size, and can de-noise the segmentation results by reducing the sliding stride and applying pamr.

And don't worry, I will add a single-image inference script and its corresponding configurations later.

Hi, @wangf3014, thank you for your nice work! I wonder if the single-image inference script is added to this repo?