wang-zhanyu / R2GenGPT

Radiology Report Generation with Frozen LLMs
BSD 3-Clause "New" or "Revised" License
50 stars 4 forks source link

preprocessed mimic-cxr annotation seems to be different with others #7

Open passby111 opened 8 months ago

passby111 commented 8 months ago

your preprocessed mimic-cxr annotation contains impression and finding. However, other's setting usually only use finding section of a report, such as R2Gen model. And I run your model on annotations only containging finding, the result is lower than other's. Can you explain? I'm not sure if I made a mistake somewhere

wang-zhanyu commented 8 months ago

Thanks for your interest. We referred to the official preprocessing code provided for parsing the reports, which contain both impressions and findings. Considering that both sections are crucial for a complete report, we retained them entirely. Regarding the experimental phenomenon you observed, could you please provide the experimental results and specify "others"?

passby111 commented 8 months ago

Thanks for your response! for example, R2Gen is a famous baseline, I notice you directly adopted the results given in their article. However, their just predicted finding section ranther than impression and finding. Would it be unfair for you to directly compare like this? And I try your model R2GenGPT on report with only finding section, the result is lower than yours based on finding + impression.

passby111 commented 8 months ago

This is the result of using your code to only predict the finding section, the CIDEr is a little lower Test result of /home/xxx/R2GenGPT-main/save/mimic_cxr/v1_deep_finding/checkpoints/checkpoint_epoch4_step112827_bleu0.132891_cider0.201495.pth: {'Bleu_1': 0.40156787140109074, 'Bleu_2': 0.24777515566553326, 'Bleu_3': 0.16587956792259043, 'Bleu_4': 0.11786933490873241, 'ROUGE_L': 0.27735867532555053, 'METEOR': 0.15643617274462107, 'CIDEr': 0.1991703269894305} And this is the result of using your code to predict the finding and impression section just like you, the result is consistent with the results of your paper Test result of /home/xxx/R2GenGPT/save/mimic_cxr/v1_test2_deep/checkpoints/checkpoint_epoch8_step135396_bleu0.190921_cider0.370094.pth: {'Bleu_1': 0.4128636987578053, 'Bleu_2': 0.2695699159478738, 'Bleu_3': 0.18779073357219533, 'Bleu_4': 0.13655461790879653, 'ROUGE_L': 0.2974765397369367, 'METEOR': 0.1622455704235942, 'CIDEr': 0.26411428019543615}

wang-zhanyu commented 8 months ago

Thank you for pointing out this issue. As we have mentioned, our work processes the official dataset, which includes both the impression and findings sections. The impression is a crucial part of a diagnostic report, and we should not omit it just because an earlier work did not use it.

Additionally, we would like to clarify that in the results reported in our paper, any method that was not fairly reproduced is marked with a dagger symbol. This is because many works do not have open-source code, in my experience, even different data preprocessing can result in variations in outcomes, making comparisons with these methods difficult to ensure absolute fairness. You can refer to methods without the dagger symbol, these are the ones we have replicated ourselves under the same experimental setup and can be compared more fairly.

Still, it's interesting to note your point. We trained our model using only findings and achieved the following results: {'Bleu_1': 0.404, 'Bleu_2': 0.252, 'Bleu_3': 0.169, 'Bleu_4': 0.121, 'ROUGE_L': 0.277, 'METEOR': 0.155, 'CIDEr': 0.209}. If you prefer to use only findings, you are welcome to use this result.

passby111 commented 8 months ago

Thank you for your attention and efforts of retraining model on the finding section. I am willing to use your result. I agree with you about "different data preprocessing can result in variations in outcomes" In addition, I have another question about your result of CE. I use https://github.com/stanfordmlgroup/chexpert-labeler to label test refs and test result, and use https://github.com/zhjohnchan/R2Gen/blob/main/compute_ce.py to calculate the result. The result of predicting finding and impression is 'F1_MACRO': 0.23245815170872425, 'F1_MICRO': 0.409047345217558, 'PRECISION_MACRO': 0.33660613529579375, 'PRECISION_MICRO': 0.4855885922330097, 'RECALL_MACRO': 0.22065983503584846, 'RECALL_MICRO': 0.35335025941053094} The result seems to be wrong. May I ask how you implemented the calculation at that time? I would greatly appreciate your help. Thank you !

passby111 commented 8 months ago

hi, can you give your results of CE about your model only predicting the finding section? I think this result is also different with your results in your paper, which predict both impression and finding. Thanks.