Open changhaonan opened 10 months ago
Hi, thank you for your questions.
In this two cases, your configurations are not sufficient to attain good qualitative results. ADE20k is the most challenging benckmark in our experiments, with only 16.45% mIoU for its validation split (although this number has already been current sota for zero-shot open-vocabulary semantic segmentation). For ADE20k images, there are 150 potential categories but the image resolutions are relatively low (around the same level as PASCAL VOC). So, if you desire favorable visualization results, you can
1) Scale up the images (SCLIP has a very good scalability), and/or 2) Reduce the number of categories (That's the advantage of open-vocab segmentation).
I simply tried the following configs
slide_stride=56, slide_crop=224, pamr_steps=10,
and classnames for ADE_test_00000002.jpg:
airstrip sky airplane car grass
and classnames for ADE_test_00000005.jpg:
room, store, escalator person carpet chair screen bag
You will get the following two qualitative results:
Here are some more details about the hyper-parameters. By default, we resize the images in ADE20k to have a shorter side of 336, and then perform slide inference with a 224-size window and 112 stride. However, it does not means this setup is optimal for our SCLIP, but instead we use it for fair comparisons to the exsting baselines. Actually, SCLIP has a very good scalability to image size. So we can improve both qualitative and quantitative result by scaling up the image size and window size, and can de-noise the segmentation results by reducing the sliding stride and applying pamr.
And don't worry, I will add a single-image inference script and its corresponding configurations later.
Hi, @wangf3014, thanks for your work. I only get a poor performance (22% in VOC21) when removing the sliding windows, which is far from 59% with sliding windows. Many predictions of images seem all background. Could you please explain this problem? Thanks.
Hi @chenqi1126, thank you for your question. This poor performance is caused by one of the de-noising strategies. To avoid this, you can simply disable the "area_thd" parameter and then you will get 58.5% mIoU with non-sliding inference. Specifically, the model configuration is
model = dict( name_path='./configs/cls_voc21.txt', logit_scale=65, prob_thd=0.1, slide_crop=0 )
The "area_thd" mode is a de-noising strategy for tasks with a background category, where it force parts of the prediction results with a total area less than this threshold to be set as the background category. Without sliding inference, the results are often noisy and there will be many small areas, so the "area_thd" mode is not suitable for this protocol.
It works in your configuration. Thank you for your reply.
Here are some more details about the hyper-parameters. By default, we resize the images in ADE20k to have a shorter side of 336, and then perform slide inference with a 224-size window and 112 stride. However, it does not means this setup is optimal for our SCLIP, but instead we use it for fair comparisons to the exsting baselines. Actually, SCLIP has a very good scalability to image size. So we can improve both qualitative and quantitative result by scaling up the image size and window size, and can de-noise the segmentation results by reducing the sliding stride and applying pamr.
And don't worry, I will add a single-image inference script and its corresponding configurations later.
Hi, @wangf3014, thank you for your nice work! I wonder if the single-image inference script is added to this repo?
Hello,
I am testing the performance on some sample images from ADE20k (e.g. ADE_test_00000002.jpg, ADE_test_00000005.jpg). I am using the vocabulary from "cls_ade20k.txt" for query.
I first checked the raw quality of the dense feature without smoothing by setting parameters to be
pamr_steps = 0
,slide_crop = 0
. The results seem quite noisy.After setting
pamr_steps = 1
,slide_crop = 227
. The results seem to get improved, though still very different from the demos showing in the paper.I'm not sure if you see similar results on your side, or perhaps I did something wrong during the inference. Are there other hyper-parameters I can tune? It would be great if you can provide a script for single image inference. Thanks!