lorenmt / reco

The implementation of "Bootstrapping Semantic Segmentation with Regional Contrast" [ICLR 2022].
https://shikun.io/projects/regional-contrast
Other
162 stars 25 forks source link

foreground & background class #34

Closed HuangBugWei closed 1 year ago

HuangBugWei commented 1 year ago

I'm amazed with your work about the design of queries and keys sampling! But I get confused at some part. I've tried the DeepLab series model and if I remember correctly, it have a class called background to remove the pixel that we don't interest. And I couldn't find that part in your code and wondering how you inference when you don't have the ground truth data (such as the code in visual.py.) Also wondering how it will influence in training / model performance part, like the model will have one more class although it not attend to estimate the mIOU.

HuangBugWei commented 1 year ago

Or I miss somewhere since i also remember that you have discussed about remove some pretty common exist object or what and treat them as background?

lorenmt commented 1 year ago

For any pixels that will not be contributed to the final mIoU, we treat them as invalid pixels and will not be sampled as queries or keys (if such pixels are provided with this information).

And please note that ReCo loss is only required for training. We don't need the representation branch for inference.

HuangBugWei commented 1 year ago

Yeah I know, in 3.3 has emphasized that there's no additional cost at inference time. So how to inference when there's no labeled data, which means that we can't execute label_sup[label_tensor == -1] = -1 to remove uninterested pixel?

lorenmt commented 1 year ago

I am sorry that I still don't follow your question. The definition of inference is that we don't update the network weights, so there is only a forward pass to compute the label from the input image.

HuangBugWei commented 1 year ago

and in your work, it seems that you remove background by label_sup[label_tensor == -1] = -1 but this can't be happened when we do inference on unlabeled data. For example, if I want to train a model which only have to segment person and dog, and design like this work that network only have two class, how this network identify those pixels that do not belong to person or dog? Or in your case, how you get rid of those uninterested pixel when inference on cityscapes dataset (assuming that it doesn't have labeled data.)

lorenmt commented 1 year ago

Ok. I understand your question now...

This label_tensor == -1 are the pixels having no clear definition of semantic classes, specifically defined in the dataset. And thus we remove these pixels just for visualisation purposes. As in the CityScapes dataset, these pixels will also not be included to compute the final mIoU.

As such, if given a new image out of the training dataset, and having no prior information, all pixels will be assigned to one semantic class defined in the training dataset.. And if the defined classes do not include the general background class (in Pascal VOC as example), then it will always be assigned to "person" or "dog" as in your example. Ideally, we should always include such "background" class to mitigate this issue.

HuangBugWei commented 1 year ago

Yeah! Thanks a lot for explanation! I just afraid that there're some critical points I did not notice or understand.