seoulsky-field / CXRAIL-dev

CXRAIL-dev
MIT License
7 stars 0 forks source link

Discussion: Frontal/Lateral images and mislabel #90

Open kdg1993 opened 1 year ago

kdg1993 commented 1 year ago

What

Fixing mislabels in frontal/lateral information for better performance

Why

Mislabels in frontal/lateral information might affect stronger than we expected.

We've conventionally trained our model only for the frontal view data. Using the frontal view is not unusual. Pham, H. H. et al.(2021), one of our preferred papers, also mentioned this restriction of data while training.

However, it is hard to overcome the validation ROC-AUC score of 0.9 so far. Since ROC-AUC 0.9 indicates that easy samples are already classified well but there are some difficult samples, it could be a sign of trying to look at the data again.

I suspect the mislabeling in the frontal/lateral view is one of the reasons for the limitation of the validation score. Restricting data to the frontal view only could be more vulnerable to the effect of mislabels in frontal/lateral information than using both frontal and lateral view. In the asymmetric loss paper, Ridnic, T. et al. (2021) emphasized the necessity of rejecting ground-truth mislabels because it affects heavily to the loss value and drives the training process the wrong way. Furthermore, in my personal opinion, it could be worse than target class mislabeling because there is no clear way to handle it without doing data preprocessing.

Let's dive into the 4 cases of mislabeling existence in train and test. (The test is changeable to the validation)

  1. Mislabels are both in train and test Most likely among the 4 cases, and the subject what I want to suggest finding a way to handle it. On the positive side, train and test are in a similar distribution but it is hard to expect a model to learn the pattern of lateral view images because of the small number of samples. Thus, mislabels disturbs the optimizer in train, and disturbs evaluating the model performance by unfairly lessening scores in test.
  2. Mislabels are in train but not in test This case will be better than cases 1 and 3. Mislabeled data might drive the model training the wrong way but if the number of mislabels in the train is not big, the test process will go as we want. However, given the data structure of MIMIC, it is hard to expect.
  3. Mislabels are not in train but in test Probably the worst case. There is no way to train to handle lateral images. For the model, it is an unseen & out of distribution case and scores might indicate overfitting. However, given the huge number of train compare to test, the possibility of this case seems very low.
  4. Mislabels are not in both train and test In this case, there is no problem but I think it is hard to guarantee. Even if CheXpert is clear about this, there is no guarantee about the other datasets such as MIMIC, BRAX, etc.


How

I suggest a preprocessing python script to handle this problem.

seoulsky-field commented 1 year ago

Wow, what a sincere content! I can find your deep consideration in this issue.

Anyway, let's talk about more and more. First, in my opinion, I think the most reason we cannot overcome the validation AUROC score of 0.9(or more) is "-1". But it's just the prospective of "most", I agree that the mislabeling frontal/lateral is the thing we have to check.

Second, I guess, test set might not have mislabeling because both validation and test were checked by radiologits in CheXpert dataset, and the other datasets could have mislabeling. Of course, more accurately, it's the best way for us to check validation and test images.

So, I suggest a new method. First, we double-check CheXpert valid, test dataset. Because of their size and their reliability, it's will be easy. (If we think we need to do more datasets, we can get train image!) Second, revise mislabeling things if they exist and train binary classification model. (Frontal vs Lateral) I think the difference between them is clear not such as classification 14 labels in CheXpert. Third, check the accuracy score and check errors. Fourth, check MIMIC-CXR's valid or test or BRAX its. (Like a way that I said in "First".) Fifth, inference model from "Third" to "Fourth" datasets. Sixth, result check. Seventh, use other datasets and train/valid/test sets and double check the result.

If we use this method, it can be need more times rather than other ways, however, we can give users to automate cleansing mislabels. (We can give pth and notebooks!)

seoulsky-field commented 1 year ago

Today, I checked test_labels.csv in the CheXpert test dataset from CheXlocalize dataset which was downloaded by Azure Storage Explorer. It doesn't have a 'Frontal/Lateral' column unlike train.csv and valid.csv, so I got view position values from the file names. (All of split csv files have view position value in 'Path' column in csv.) And after that, I checked the matches correctly between view position values and images one by one. Fortunately, the mislabeling values did not exist.

So, from this methodology, I checked train.csv and valid.csv in CheXpert dataset. In this situation, I couldn't check the matches between the view position values from 'Path' column and images, also it's not necessary because both train.csv and valid.csv have 'Frontal/Lateral' column in each csv file. Namely, I should compare the view position values between from the file names and 'Frontal/Lateral' columns in csv file.

These two images are the result images. The top one represents the result of train.csv and the bottom one represents the result of valid.csv. As you can see, fortunately, there are no mislabeling values in CheXpert datasets! (Of course, it's more better to double-check one by one between images and column values. However, it's not easy work for us that we have not enough time.)

스크린샷 2023-02-09 오후 3 12 23 스크린샷 2023-02-09 오후 3 12 07

And thanks for your great discussions! Also, we must check in MIMIC dataset same!