ttanida / rgrg

Code for the CVPR paper "Interactive and Explainable Region-guided Radiology Report Generation"
MIT License
131 stars 24 forks source link

How are example-based averages calculated? #4

Closed PabloMessina closed 1 year ago

PabloMessina commented 1 year ago

Hey, congratulations for the awesome work.

I have a question about metrics calculation. How is example-based average calculated? What is the difference between micro average, macro average and example-based average?

Lastly, do you have macro average results for the Clinical Efficacy metrics considering the 14 labels?

Thank you very much.

ttanida commented 1 year ago

Hello Pablo,

sure, I'd be happy to explain these concepts!

Example-based Average:

Example-based averaging treats each sample (or example) in the dataset independently and then averages over all samples. In our case, a sample corresponds to a report, and we used all 14 conditions to evaluate a report (hence Pex-14, Rex-14, F1, ex-14 in Table 2 of the main paper). Meaning we compared each generated report with the corresponding reference report regarding the non-/presence of the 14 conditions, computed precision, recall and F1 for each report individually, and then averaged the scores over all reports.

You can see exactly how we do this in our compute_example_based_CE_scores function.

Micro Average:

Micro-averaging calculates the metric globally by counting the total true positives, false negatives and false positives. It aggregates the contributions of all classes (in our case conditions) to compute the average metric. In our case, we computed the micro average over 5 specific conditions (hence Pmic-5, Rmic-5, F1, mic-5), meaning we aggregated the counts of true positives, false positives, and false negatives across these 5 conditions in all the generated reports, and then computed precision, recall, and F1 from these aggregated counts.

You can check out the function compute_micro_average_CE_scores to see how we implemented this.

Macro Average:

Macro-averaging calculates the metric independently for each class/condition and then takes the average (hence treating all classes equally), which calculates the mean of the metric scores over all classes without considering their proportion in the data.


Regarding the macro average CE results over all 14 conditions, please have a look at table B.4 in the supplementary materials of our paper (see page 17 in https://arxiv.org/pdf/2304.08295.pdf). You can just average the results of the 14 conditions to get the macro-averaged results.

I hope these explanations help you understand these metrics better. If you have any more questions, feel free to ask!

PabloMessina commented 1 year ago

Thank you @ttanida for the detailed response. That clarifies everything. I'm used to working with macro and micro averages, and we had some decent results that we published in MedNeurips 2022 (paper here) using macro and micro, and so when I saw the clinical efficacy results in Table 2 I was impressed, but I was thinking of macro average and so I got a bit confused when I read that you were using example-based average instead, but now the confusion has been cleared up. In our case we report macro and micro averages using the 14 observations, but we used the CheXpert labeler instead of CheXbert. Do you think switching from CheXpert to CheXbert should be easy, and should the results change too much if I do so, or do you think the results should be highly correlated between CheXpert and CheXbert?

Thank you very much again for your assistance.

ttanida commented 1 year ago

I would definitely assume that the results will be different, as CheXbert is the “upgraded” version of CheXpert.

Switching should be relevant straightforward, you can just take the relevant parts of my language model evaluation code.

PabloMessina commented 1 year ago

Thank you. One more question. Do you run CheXbert sentence by sentence and then merge the outputs, or do you run CheXbert over the whole report? My educated guess is that BERT should be able to detect observations more accurately if the task is simplified by only providing one sentence at a time, and then merging the predictions afterwards to obtain the predictions for the whole report, instead of providing the whole report at once.

ttanida commented 1 year ago

Take a look at this line of code where I use CheXbert to extract the predicted conditions for the generated reports.

I just use the label function from src/CheXbert/src/label.py, which is basically just copied from the original CheXbert repo. It expects the CheXbert model and a csv path to the reports as arguments, and returns a nested list of predicted conditions. I haven't investigated how exactly CheXbert extracts these conditions "under the hood", i.e. if it uses the whole report at once or goes sentence by sentence (I would assume the former though).

However, I think you should use CheXbert as it was intended, i.e. just by passing the required arguments to the label function and getting the predicted conditions back, as "messing" with the evaluation method would make it impossible to compare your results with other researchers' results.