Open BAOOOOOM opened 5 months ago
Thanks for your interest in our study!
As described in the paper, our method consists of two stages of training, i.e., learning the prototypes and leveraging the prototypes (fixed) for visual reasoning. In the first stage, the prototypes (represented as weights in the linear layers) are first randomly initialized and then trained on a multi-label object classification task, where the predictions are computed based on an adaptive composition of prototypes. In this way, we can learn prototypes representing various objects. For more details, please refer to the "proto_learning" folder.
In terms of results in Table 4, they are computed by (1) Converting the object features for different bounding boxes in the GQA dataset (where each bounding box is associated with an object label) into the probabilistic distribution of prototypes (which is the normalized dot product between object features and all prototypes). We average the distribution for all instances of each object. (2) Applying a (unsupervised) clustering algorithm (we simply used K-means) on the probabilistic distribution. (3) Investigating the characteristics of objects in different groups.
I am very interested in your research. While going through the code, I encountered some issues. Firstly, in the prototype decomposition section, is there no design for NLP prototype decomposition? Another question is in proto_learning/dataloader.py: self.obj2idx = json.load(open(os.path.join(data_dir, 'obj2idx_gqa.json'))). I couldn't find the code that generates the 'obj2idx_gqa.json' file.
I am very interested in your research. While going through the code, I encountered some issues. Firstly, in the prototype decomposition section, is there no design for NLP prototype decomposition? Another question is in proto_learning/dataloader.py: self.obj2idx = json.load(open(os.path.join(data_dir, 'obj2idx_gqa.json'))). I couldn't find the code that generates the 'obj2idx_gqa.json' file.
You are right that there is no prototype on language modality, as the study focuses on learning compositional visual representation. The "obj2idx_gqa" file can be downloaded from the preprocessed annotation link. These annotations are essentially derived from the GQA raw annotations to formulate the multi-label object classification task (i.e., simultaneously predicting all objects within an image), for instance, "obj2idx_gqa" is a dictionary mapping the object labels to indices.
Thank you very much for your work, it was very interesting! But I'm curious about how prototypes learn. The article says that prototypes can be learned, represented as a linear layer in the code. However, since the initial state is equivalent to random, should we let it learn gradually? Why can we get object representations like Table 4 by learning in this way?