Closed HuangBugWei closed 1 year ago
Or I miss somewhere since i also remember that you have discussed about remove some pretty common exist object or what and treat them as background?
For any pixels that will not be contributed to the final mIoU, we treat them as invalid pixels and will not be sampled as queries or keys (if such pixels are provided with this information).
And please note that ReCo loss is only required for training. We don't need the representation branch for inference.
Yeah I know, in 3.3 has emphasized that there's no additional cost at inference time.
So how to inference when there's no labeled data, which means that we can't execute label_sup[label_tensor == -1] = -1
to remove uninterested pixel?
I am sorry that I still don't follow your question. The definition of inference is that we don't update the network weights, so there is only a forward pass to compute the label from the input image.
and in your work, it seems that you remove background by label_sup[label_tensor == -1] = -1
but this can't be happened when we do inference on unlabeled data.
For example, if I want to train a model which only have to segment person and dog, and design like this work that network only have two class, how this network identify those pixels that do not belong to person or dog?
Or in your case, how you get rid of those uninterested pixel when inference on cityscapes dataset (assuming that it doesn't have labeled data.)
Ok. I understand your question now...
This label_tensor == -1 are the pixels having no clear definition of semantic classes, specifically defined in the dataset. And thus we remove these pixels just for visualisation purposes. As in the CityScapes dataset, these pixels will also not be included to compute the final mIoU.
As such, if given a new image out of the training dataset, and having no prior information, all pixels will be assigned to one semantic class defined in the training dataset.. And if the defined classes do not include the general background class (in Pascal VOC as example), then it will always be assigned to "person" or "dog" as in your example. Ideally, we should always include such "background" class to mitigate this issue.
Yeah! Thanks a lot for explanation! I just afraid that there're some critical points I did not notice or understand.
I'm amazed with your work about the design of queries and keys sampling! But I get confused at some part. I've tried the DeepLab series model and if I remember correctly, it have a class called background to remove the pixel that we don't interest. And I couldn't find that part in your code and wondering how you inference when you don't have the ground truth data (such as the code in visual.py.) Also wondering how it will influence in training / model performance part, like the model will have one more class although it not attend to estimate the mIOU.