Hi! Thanks for releasing the code. This is very helpful!
I'm particularly interested in how your model is applied to the public dataset BraTS2019. Can you please provide more details on how the experiments was conducted?
I skimmed through the training code, and it looks like there is a classification layer nn.Linear where the output size is pre-configured during training, which means the pre-trained model would output the score of 13 categories used in training data. The input to this layer is the disease-attentive embedding h, as illustrated in Fig 2 and described in section 3.4.3. This means the pre-trained model doesn't generate tokens as the final output or intermediate output. In the paper however, it states that:
During inference phase, we can directly manipulate the diseases of interest with the proper description in Q, and forward to
UniBrain to have the predictions.
Wouldn't these predictions still be bounded to the 13 categories used in the pre-trained model? Simply dropping the last layer doesn't seem to be helpful either, as the output will be the embedding h. What needs to be done in the model for this zero-shot classification on BraTS2019?
The model requires text input in the Coupled Vision-Language Perception module. There is no text data in BraTS2019. In this case, what were the disease descriptions Q used in the inference?
Regarding fine-tuning, is it correct that you replaced the 13-class classification layer with a 2-class one (since BraTS2019 has 2 classes) and ran a couple of epochs to fine-tune the weights in all the layers except for the text encoder, or if the text encoder was updated as well?
Hi! Thanks for releasing the code. This is very helpful!
I'm particularly interested in how your model is applied to the public dataset BraTS2019. Can you please provide more details on how the experiments was conducted?
nn.Linear
where the output size is pre-configured during training, which means the pre-trained model would output the score of 13 categories used in training data. The input to this layer is the disease-attentive embedding h, as illustrated in Fig 2 and described in section 3.4.3. This means the pre-trained model doesn't generate tokens as the final output or intermediate output. In the paper however, it states that:Wouldn't these predictions still be bounded to the 13 categories used in the pre-trained model? Simply dropping the last layer doesn't seem to be helpful either, as the output will be the embedding h. What needs to be done in the model for this zero-shot classification on BraTS2019?
The model requires text input in the Coupled Vision-Language Perception module. There is no text data in BraTS2019. In this case, what were the disease descriptions Q used in the inference?
Regarding fine-tuning, is it correct that you replaced the 13-class classification layer with a 2-class one (since BraTS2019 has 2 classes) and ran a couple of epochs to fine-tune the weights in all the layers except for the text encoder, or if the text encoder was updated as well?