Report generation results

PabloMessina commented 1 year ago

Hi,

I have a few questions about the report generation results reported in the paper (https://arxiv.org/ftp/arxiv/papers/2208/2208.05140.pdf), which can be found in Table 2. Specifically regarding the clinical evaluation (using the CheXpert labeler).

Are these micro average or macro average results? If micro, do you have the macro average results as well?
MedViLL obtained 71.1 (0.5), 58.0 (0.7), 53.3 (0.9) in accuracy, precision and recall respectively. However, in Table 5 of MedViLL's paper (https://arxiv.org/pdf/2105.11333.pdf) they report 84.1, 69.8, 55.9, 62.1 in accuracy, precision, recall and f1-score respectively. Do you know what may be the explanation for the difference in the results?

Thank you very much.

sangjoon-park commented 1 year ago

Thank you for your interest in our work.

The reported results are the micro average results. If you want to see the macro average results, I'll calculate them and let you know.
If you look at the GitHub of MedViLL, only the vision-language pre-trained weights are provided, and therefore the downstream model should be fine-tuned from these VLP weights. We have fine-tuned the MedViLL generation model with the default hyperparameter they provided and got the similar level of results they reported in their paper (e.g. 84.1, 69.8, 55.9, 62.1). However, on reviewing the code, we have found that in the inference code for the generation of MedViLL, the "ground truths (label)" of the previous words are provided for language modeling (next word prediction) during the inference, instead of the previously "generated words" before the target for prediction. As we think that the generation should be started only from the [BOS] token and giving the "generated results" of previous words is proper, we have changed the generation code of MedViLL to meet this greedy decoding strategy. With this implementation, performances were 73.7 (0.4), 62.3, (0.5), and 57.4 (0.8), as we reported in our paper.

PabloMessina commented 1 year ago

Thank you for your very prompt reply.

Yes, please. I think having the macro average results, or actually the results per each of the 14 CheXpert labels, can provide a more thorough understanding of the strengths and weaknesses of the model. These results are being calculated with the test set of MIMIC-CXR, correct?
Oh, I see. I didn't know they were doing that. I agree that's kind of "cheating". I originally thought MedViLL was a very powerful method given the very high metrics they reported (compared to other works), but using ground-truth tokens during inference instead of normal greedy decoding certainly will boost performance and make the model look better than it actually is when no ground truth is available. Did you talk to MedViLL's authors about this?

sangjoon-park commented 1 year ago

I have just checked the MedViLL code again, and found that they have updated the code recently. Therefore, I'm not sure the ground truth tokens are still referred during the generation in current version. I'll check it, and if not in the current code, I think we should update the MedViLL results in our paper and our results, if needed, evaluated in fair situation.

Thank you for letting me know.

PabloMessina commented 1 year ago

Perfect, I look forward to your updated results. Thank you very much.

sangjoon-park / Medical_X-VL

Report generation results #1