Closed zengyuy closed 2 years ago
Dear @LiFu2001, thank you for your interest. Yes, we have designed LOST based on DINO representations that have useful properties for localizing objects (for more details please see our paper and DINO paper), e.g. allows to delineate well foreground from background and background pixels are highly correlated. We have other experiments in the paper on Imagenet pre-trained models (Table 2). We haven't tried LOST with CLIP so it is hard to know why it doesn't work. Maybe you could try playing with the hyper-parameters.
OK thanks! And I just wonder what has led to the design of LOST? It seems from the paper to be solely empirical and lack theoretical corroboration. Is there a more common approach that can apply not only to DINO pre-trained features but also those from other self-supervised methods?
Another question is the inconsistencies I've found in the code with Eq (1), (2) and (3) in the paper, which are in object_discovery.py
. In the paper the elements of the matrices are either 0 or 1, which is not the case in the code (just sums the elements, but does not set the elements to 0 or 1). Why?
Dear @LiFu2001, as mentioned above and in our paper, LOST was designed considering the properties of the representation of DINO (it allows to delineate well foreground from background and background pixels are highly correlated). Depending on the type of backbone and type of supervision for training the network (e.g., supervised classification, self-supervised instance discrimination, self-supervised feature reconstruction, rotation estimation, etc.) the properties of the learned representations can vary. To the best of our knowledge, there is currently no clear theory about self-supervised representation statistics that generalizes over pre-text tasks and backbones. We would be interested if you know any work/direction.
Regarding your question about the binary matrices of Eq (1), (2), (3), we do not store them in a variable. However, they are build before being used with A > threshold
(see l61, l35 in object_discovery.py).
I've replaced LOST's backbone (basically the dino weights) with the ones in CLIP, and it did not work well. But when switching back to dino weights, both ViT and ResNet50 backbone could generate good feature maps. Why would this happen?