Very good work. I have some questions about your work.
1) My understanding is that the region selection part will do the filtering of the 29 visual features and only keep the visual features that need to appear in the report, is that correct?
2) How does the Abnormality classification work for the report generation, I see that in the architecture diagram, it only has input during training and no output.
3) What is the data of Token embeddings in the architecture diagram for the report generation essentially and what is the source?
I would be very grateful to receive your reply!
Exactly, the object detector outputs region visual features of shape [29, 1024] (i.e. a feature vector of dimension 1024 for every detected region) and the region selection module then selects those it deems important. So the output of the region selection module could e.g. be [5, 1024] for 5 selected regions that will be described in the report.
At training time, the abnormality classification module is trained end-to-end together with the object detector such that the object detector learns to encode “abnormality” or “pathology” information in the region visual features. This is because the abnormality classification module relies on this encoded information for accurate abnormal region identification. At inference time, the region visual features will thus contain this “abnormality” or “pathology” information, which will be used by the region selection module to select these important abnormal regions. So in short, the abnormality classification module is only used to strengthen the “pathology” signal in the region visual features outputted by the object detector.
Let's assume we are at training time and the current chest X-ray image we are training with is described by 5 regions in the reference report, i.e. we have 5 ground-truth sentences for 5 regions. To train the language model, we need to tokenize and embed these ground-truth sentences. These are the token embeddings "Y" (of dimension 1024) shown in the yellow box in Fig. 2. Using a technique called pseudo self-attention (see Eq. 2), we train the language model to generate descriptive text considering both the region visual features "X" and their corresponding token embeddings "Y".
I hope this helps clarify your queries. Please feel free to ask any more questions you might have!
Very good work. I have some questions about your work. 1) My understanding is that the region selection part will do the filtering of the 29 visual features and only keep the visual features that need to appear in the report, is that correct? 2) How does the Abnormality classification work for the report generation, I see that in the architecture diagram, it only has input during training and no output. 3) What is the data of Token embeddings in the architecture diagram for the report generation essentially and what is the source? I would be very grateful to receive your reply!