Open reflelia opened 1 month ago
Thank you for pointing this out. In fact, we did not use 'pre' as hard predictions for calculating metrics; we only used the scores to compute the AUC. Therefore, this part of the code does not require attention. If we wish to calculate metrics in the future, we prefer using prompts like 'There is [Disease]' and 'There is no [Disease]'. After generating the corresponding scores, we can make hard predictions by comparing their numerical values.
The process used to generate predicted labels in the inference.py script does not seem to take multi-label classification into account. The ground truth labels contain probabilities for multiple lesions (since this is a multi-label classification task), but the script uses
np.argmax
to select only a single lesion out of 14 possible lesions (ChestX-ray 14 dataset).In summary, is the inference.py script wrong for this task? Or is the task itself not intended to be multi-label classification?