Discussion: Frontal/Lateral images and mislabel

kdg1993 commented 1 year ago

What

Fixing mislabels in frontal/lateral information for better performance

Why

Mislabels in frontal/lateral information might affect stronger than we expected.

We've conventionally trained our model only for the frontal view data. Using the frontal view is not unusual. Pham, H. H. et al.(2021), one of our preferred papers, also mentioned this restriction of data while training.

However, it is hard to overcome the validation ROC-AUC score of 0.9 so far. Since ROC-AUC 0.9 indicates that easy samples are already classified well but there are some difficult samples, it could be a sign of trying to look at the data again.

I suspect the mislabeling in the frontal/lateral view is one of the reasons for the limitation of the validation score. Restricting data to the frontal view only could be more vulnerable to the effect of mislabels in frontal/lateral information than using both frontal and lateral view. In the asymmetric loss paper, Ridnic, T. et al. (2021) emphasized the necessity of rejecting ground-truth mislabels because it affects heavily to the loss value and drives the training process the wrong way. Furthermore, in my personal opinion, it could be worse than target class mislabeling because there is no clear way to handle it without doing data preprocessing.

Let's dive into the 4 cases of mislabeling existence in train and test. (The test is changeable to the validation)

Mislabels are both in train and test Most likely among the 4 cases, and the subject what I want to suggest finding a way to handle it. On the positive side, train and test are in a similar distribution but it is hard to expect a model to learn the pattern of lateral view images because of the small number of samples. Thus, mislabels disturbs the optimizer in train, and disturbs evaluating the model performance by unfairly lessening scores in test.
Mislabels are in train but not in test This case will be better than cases 1 and 3. Mislabeled data might drive the model training the wrong way but if the number of mislabels in the train is not big, the test process will go as we want. However, given the data structure of MIMIC, it is hard to expect.
Mislabels are not in train but in test Probably the worst case. There is no way to train to handle lateral images. For the model, it is an unseen & out of distribution case and scores might indicate overfitting. However, given the huge number of train compare to test, the possibility of this case seems very low.
Mislabels are not in both train and test In this case, there is no problem but I think it is hard to guarantee. Even if CheXpert is clear about this, there is no guarantee about the other datasets such as MIMIC, BRAX, etc.

Pham, H. H., Le, T. T., Tran, D. Q., Ngo, D. T., & Nguyen, H. Q. (2021). Interpreting chest X-rays via CNNs that exploit hierarchical disease dependencies and uncertainty labels. Neurocomputing, 437, 186-194. (https://arxiv.org/pdf/1911.06475.pdf)

Ridnik, T., Ben-Baruch, E., Zamir, N., Noy, A., Friedman, I., Protter, M., & Zelnik-Manor, L. (2021). Asymmetric loss for multi-label classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 82-91). https://openaccess.thecvf.com/content/ICCV2021/papers/Ridnik_Asymmetric_Loss_for_Multi-Label_Classification_ICCV_2021_paper.pdf)

How

I suggest a preprocessing python script to handle this problem.

If we can assume that the number of mislabel is small enough, I think that a model can learn the difference between frontal and lateral
- Manually picking images can be better in terms of data accuracy but it is time-consuming and hard to convince users to try due to the reproducibility
- Since there are many dataset such as CheXpert, MIMIC, Brax, etc, it could be a time-efficient choice if we can make a reusable script for every dataset
- The purpose of this script will be not to achieve perfect data cleaning but would like to have cleaner data with a reproducible & automated process
Given that classifying frontal/lateral is not a difficult problem, it plausibly does not need a hyperparameter tuning. Thus, it will be okay with a hard-coded model and the other settings
Using our code will be more time-efficient than making it from scratch
- Since we already have the data loader class, it could be efficient to use it (e.g. add a data-cleaning mode as a keyword argument)
- Using paths in hydra config will be good considering the user reproducibility
Personally, creating a new csv file such as train_cleaned.csv will be enough as a result of this script but any other idea will always be welcomed
- Using if else syntax by
  if (train_cleaned.csv not exists in the path) -> running the script else -> pass
  will be good to save the time because cleaned csv is no need to make again and again in every experiments

seoulsky-field commented 1 year ago

Wow, what a sincere content! I can find your deep consideration in this issue.

Anyway, let's talk about more and more. First, in my opinion, I think the most reason we cannot overcome the validation AUROC score of 0.9(or more) is "-1". But it's just the prospective of "most", I agree that the mislabeling frontal/lateral is the thing we have to check.

Second, I guess, test set might not have mislabeling because both validation and test were checked by radiologits in CheXpert dataset, and the other datasets could have mislabeling. Of course, more accurately, it's the best way for us to check validation and test images.

So, I suggest a new method. First, we double-check CheXpert valid, test dataset. Because of their size and their reliability, it's will be easy. (If we think we need to do more datasets, we can get train image!) Second, revise mislabeling things if they exist and train binary classification model. (Frontal vs Lateral) I think the difference between them is clear not such as classification 14 labels in CheXpert. Third, check the accuracy score and check errors. Fourth, check MIMIC-CXR's valid or test or BRAX its. (Like a way that I said in "First".) Fifth, inference model from "Third" to "Fourth" datasets. Sixth, result check. Seventh, use other datasets and train/valid/test sets and double check the result.

If we use this method, it can be need more times rather than other ways, however, we can give users to automate cleansing mislabels. (We can give pth and notebooks!)

seoulsky-field commented 1 year ago

Today, I checked test_labels.csv in the CheXpert test dataset from CheXlocalize dataset which was downloaded by Azure Storage Explorer. It doesn't have a 'Frontal/Lateral' column unlike train.csv and valid.csv, so I got view position values from the file names. (All of split csv files have view position value in 'Path' column in csv.) And after that, I checked the matches correctly between view position values and images one by one. Fortunately, the mislabeling values did not exist.

So, from this methodology, I checked train.csv and valid.csv in CheXpert dataset. In this situation, I couldn't check the matches between the view position values from 'Path' column and images, also it's not necessary because both train.csv and valid.csv have 'Frontal/Lateral' column in each csv file. Namely, I should compare the view position values between from the file names and 'Frontal/Lateral' columns in csv file.

These two images are the result images. The top one represents the result of train.csv and the bottom one represents the result of valid.csv. As you can see, fortunately, there are no mislabeling values in CheXpert datasets! (Of course, it's more better to double-check one by one between images and column values. However, it's not easy work for us that we have not enough time.)

And thanks for your great discussions! Also, we must check in MIMIC dataset same!

seoulsky-field / CXRAIL-dev

Discussion: Frontal/Lateral images and mislabel #90

What

Why

How