Closed Xqq2620xx closed 1 year ago
Hi there,
Your concerns resonate with aspects we've anticipated and already discussed in detail in the supplementary materials of our paper. I would like to direct you to our ablation study (A.1) and the detailed section on the region selection module (A.3). They can be located on pages 12 and 13 of our arXiv submission, respectively.
If you still have questions after reading these two sections, please don't hesitate to reach out again. I'm more than happy to clarify any lingering doubts or engage in further discussion to ensure a comprehensive understanding.
Best, Tim
Thank you for your response. I went through the supplementary materials, and my concerns have been largely addressed. And I have two points I'd like to confirm:
"Precision is 1.0 for abnormal regions since by default abnormal regions are always included in reference reports." I have some doubts about this statement, why are abnormal regions always included in reference reports by default?
I completely agree with the statement in the paper, "Thus, we believe that this rather subjective decision cannot be learned by a model and a low precision score for normal regions is expected." I've had this question for a while, but I haven't come across a clear explanation in previous papers. I wanted to ask if, based on the above phenomenon, the current NLP evaluation metrics might be somewhat unreasonable for report generation tasks? In other words, are our current efforts to improve performance mainly focused on classifying abnormal regions?
This is indeed an excellent piece of work, highly interpretable, and it has been very helpful for me. Thank you again!
I have understood question 1, sorry, I did not understand the meaning of your article earlier. Would you like to share your views on question 2?😊
Hi, sorry I was in the middle of writing my answer earlier and accidentally clicked on "Close with comment". I will give you my full answer soon.
As you correctly wrote in your original question, the radiologist's decision to describe normal (pathology-free) anatomical regions in a report is highly subjective/random. However, the reporting of abnormal regions is not bound by such subjectivity. Radiologists are ethically and professionally obligated to report any identified abnormalities, to ensure optimal patient care. As such, these abnormal regions are consistently included in reference reports. When assessing the region selection module for only abnormal regions (as illustrated in table 6), we find a precision of 1.0. This is because, in this context, false positives — instances where a normal region is erroneously reported as abnormal — are non-existent (as we only consider abnormal regions).
Current NLG metrics might not be the best fit for evaluating medical report generation. As you highlighted, one of the reasons is the subjective choices radiologists make, such as their decision to mention or omit descriptions of normal anatomical regions. Take the ROUGE-L metric, for instance. Originally designed to measure the quality of text summaries, it may penalize reports that correctly describe more healthy anatomical areas than reference reports do, leading to lower scores. Yet from a clinical perspective, this isn't necessarily an issue.
I see NLG metrics as "legacy" metrics that were used to evaluate medical report generation mainly due to the lack of better, domain-specific alternatives back then. Essentially, text generation was the closest parallel to medical report generation. However, in my opinion, it's becoming clear that a more specialized approach, like the CE metrics, might be better. These metrics focus on what's truly pivotal in clinical practice: ensuring reports accurately mention pathologies.
In other words, are our current efforts to improve performance mainly focused on classifying abnormal regions?
Could you elaborate on what you mean by this?
The meaning of my sentence is that, for the current NLP evaluation metrics, focusing on the abnormal parts seems to yield faster improvements in results. (This is because the descriptions for abnormalities in the ground truth are completely accurate, while descriptions for normal regions are more subjective.)
Thank you for your response, I completely understand! 😊🌹
Sorry to bother you again. Could you please let me know how to obtain the sentences corresponding to the region visual features? Are they from the Chest ImaGenome v1.0.0 dataset?
If you take a look this script for generating reports for chest X-ray images, you can see in the function get_report_for_image how an image tensor is transformed into output ids (line 109), which are then decoded by the tokenizer into sentences for each selected region (line 118).
The language model was trained on sentences from Chest ImaGenome.
Hi, this is really a nice work! And I have a question to inquire about.
The Binary Classifier (the green part in Fig. 2) in the Region Selection module is used to identify salient regions and is supervised using the loss function L_select. Is this type of classifier effective? By examining the reports in the dataset, I've noticed that the anatomical regions mentioned in each report are different and sometimes even random (e.g., when doctors describe normal phenomena, their choice of anatomical regions is highly subjective).
Therefore, I'm unsure about the guiding role of this Binary classifier in the inference process. Do you have any insights on this matter? I'm not sure if my understanding is correct.