Bad performance when making predictions with the CheXpert model

mlmed / torchxrayvision

TorchXRayVision: A library of chest X-ray datasets and models. Classifiers, segmentation, and autoencoders.

https://mlmed.org/torchxrayvision

Apache License 2.0

900 stars 216 forks source link

Bad performance when making predictions with the CheXpert model #48

Closed lkourti closed 3 years ago

lkourti commented 3 years ago

Hi, thanks for making your work available! I'm trying to do a Fairness analysis and as a first step I need to obtain the model's predictions. I'm focusing on the CheXpert dataset and the CheXpert model. I reproduced the same split (seed=0) as you do, and then made predictions for the Test set using your CheXpert model. Computing AUC and other metrics on the Test set results to quite mediocre performance, far worse from what is reported on the paper. So I was wondering if I'm missing something big.

Let me note here that I am using the 'small' version of CheXpert (same as you do) and that I am transforming the Test set data when I create the CheX_Dataset object in the following way:

Your feedback on what I might be doing wrong would be extremely helpful!

ieee8023 commented 3 years ago

The "chex" model provided is not the "CheXpert" model from their paper. It is a model that I trained on their data.

I am planning to integrate the CheXpert model in baseline_models soon. Here is their code and weights: https://worksheets.codalab.org/bundles/0x391af631e8a24c919bee32622e35ef06

lkourti commented 3 years ago

Thank you for the quick reply! So, the 'chex' model isn't the one the authors of CheXpert have trained, but it is the one for which you present results in your paper (https://arxiv.org/pdf/2002.02497.pdf), right? I'm not achieving anything close to the performance you achieve here: Is there any reason why I shouldn't expect to get a similar performance? Do you remember on which subset of the CheXpert dataset you evaluated the performance of the 'chex' model? Also, how did you handle the 'Uncertain' (-1) labels in CheXpert. For example, if your model predicted 1 (positive to a disease) and the true label is -1 (i.e. Uncertain), do you consider this as correctly or falsely classified? Thank you in advance!

ieee8023 commented 3 years ago

Yes, it is used in that paper. The results in that plot should be similar to the results you achieve. The splits were randomly sampled and I didn't save them. That plot is the average of 3 training/valid/test splits.

The -1 labels are ignored, both during training as well as during evaluation.

Are you loading the samples from the xrv dataloader? The images need to be normalized using the function xrv.datasets.normalize as seen here in this script: https://github.com/mlmed/torchxrayvision/blob/master/scripts/process_image.py#L30

Can you post your evaluation script and I can check it?

lkourti commented 3 years ago

I am using the xrv dataloader, so the normalization should be done in the correct way. Your comment on handling -1 labels was very helpful, thank you! Some other papers that work with CheXpert dataset are mapping -1 to the negative class. I did the same and that led to inaccurate performance evaluation. By ignoring -1 during evaluation I'm getting similar performance to yours now.

One last question, your model outputs scores, how should I set the thresholds for the test set to be consistent with what you are doing?

ieee8023 commented 3 years ago

Great!

The outputs are calibrated using the operating point of the AUC as discussed in these papers (https://arxiv.org/abs/2002.02497 and https://arxiv.org/abs/1901.11210) so 0.5 should be the estimated decision boundary.

For evaluation the AUC will take into account all thresholds so it doesn't matter. But when using the model I use >0.6 for pos and <0.4 for neg as thresholds just to have a good margin.

romanlutz commented 3 years ago

@ieee8023 Thanks a ton for your help on this!

I have a small follow-up question: If you apply two thresholds, would the cases in between 0.4 and 0.6 be left for review by a radiologist?

ieee8023 commented 3 years ago

First off this model is not for medical use and is just for research.

I think it depends on your use case. I think they should all be reviewed by an expert user. I wrote about my perspective on how we should be using these models in this paper: https://openreview.net/forum?id=rnunjvgxAMt

romanlutz commented 3 years ago

I completely agree, especially after reading https://lukeoakdenrayner.wordpress.com/2019/02/25/half-a-million-x-rays-first-impressions-of-the-stanford-and-mit-chest-x-ray-datasets/ I certainly agree that providing richer output to the expert user (like you're doing in Gifsplanation, for example) is very much needed.

I was just curious why you'd leave the margin between 0.4 and 0.6 and the only reason I could come up with was this. My inquiry is purely for research purposes on fairness as @lkourti noted and we'll make sure to put similar disclaimers on anything resulting from it.

ieee8023 commented 3 years ago

I picked those numbers because around 0.5 there are a lot of errors. Also 0.25 and 0.75 could be good thresholds. I would prefer to pick those thresholds using a desired PPV or NPV but the performance is not good enough so if you try to pick a PPV of 80% the threshold is like 0.99.

ieee8023 commented 3 years ago

FYI: The true CheXpert model is now available! https://twitter.com/torchxrayvision/status/1410255900862976007

Also, I ran benchmarks on the models that are available in the repo now: https://github.com/mlmed/torchxrayvision/blob/master/BENCHMARKS.md