Features: Asking for help to add new policies to convert CheXpert target class in our custom Dataset Class

kdg1993 commented 1 year ago

What

Add more policy options based on statistical or intuitive aspects of missing and label converting (Not based on domain knowledge or score)

Why

While I've looked around the target class distribution of CheXpert CSV data, I found an interesting possibility for data handling. The figure below is a snapshot of target distribution by my personal exploration of CheXpert.

Meanwhile, our current custom Dataset class converts, (Not sure but I guess this way of converting is based on score)

Nan -> 0 -1 -> 1 ( if the target is 'Edema' or 'Atelectasis' ) -1 -> 0 ( if the target is neither 'Edema' nor 'Atelectasis')

In my opinion, converting Nan to 0 is acceptable because 'nothing' often means False (0). Thus, the thing is converting the '-1'

In train set, 11 disease columns among 14 have more labels 1 than 0. Thus, converting -1 to 1 also makes sense to me

In line with distribution-based thinking, converting -1 by random sampling from the total set of 0 and 1 could also be an interesting approach

Likewise, I think there are many ways to apply statistical or intuitive aspects of handling missing values in the traditional ML field. So, I want to discuss it and carefully ask for help to make this idea possible to use in our custom codes

FYI, I include the distribution of valid set just for sharing knowledge but I'm afraid that considering the validation set distribution might be connected to the data leakage issue. Probably everyone knows already but mentioned it just for reminding 😄

How

Any kind of interesting idea can be an option
My simple idea now is to consider the major class for converting candidate or random sampling

jieonh commented 1 year ago

In the paper that ranked 2nd in the Chexpert benchmark, i found that various experiments on uncertained labels were conducted. (https://arxiv.org/abs/1911.06475)

A brief summary of the experimental setting is as follows, and the results table for the experiment is attached below.

default setting: U-Ignore, U-Ones, U-Zeros

additional policy:

CT (conditional Training): Consider the hierarchical structure between labels

LSR (label smoothing regularization): machine learning algorithm

Presumably, the application of policies that differ only in two columns in libauc seems to have chosen the most accurate one after conducting several experiments themselves.

kdg1993 commented 1 year ago

Thank you so much for solving my question about the reason for converting labels @jieonh 👍

If I understand right what you shared, the table supports the reason for converting by validation score. I couldn't agree more that score based method is one of the concrete and well-supported way to choose a way to experiment.

However, in terms of providing convincing options for experiments to the users of our testbed, I think my suggestion about expanding the data converting option is still worth to do it.

So, I want to ask about the worthiness of doing this work first. Secondly, if it is worth it, want to know if anyone is interested in doing this. If worth but everyone is busy, then I think it's on me to do it 😄 Please let me know your opinion because it is totally fine and grateful to me to say a word of reply.

jieonh commented 1 year ago

When I looked it up a little bit more, it seemed that there was a lot of research going on on about uncertaining quantification. I guess that's because the importance of data centric ai is emerging these days, so i agree that the process of further investigating the data itself is worthwhile.

I'm not sure if I can fully concentrate on that task for now, but I can assist you or do some research to catch up (if that will be anyhelp!).

+) Does anyone know the exact difference between uncertained labels(-1) and missing values(Nan)? I'm little bit confused.

chrstnkgn commented 1 year ago

In addition, I found a detailed datasheet for CheXpert for those of you who may be interested https://arxiv.org/pdf/2105.03020.pdf

You can refer to p3-6 for the info we are looking for (labeling protocol)! In summary, the labeling section in this sheet explains how the labels are assigned based on the keywords found in the report, and how 'No findings' labels are assigned (which fully addressed my concern about normal data)

+) I think this sheet might explain the difference btwn -1 and Nan labels @jieonh just asked. You can refer to Table 3 of the sheet which describes the label definitions.

kdg1993 commented 1 year ago

What a nice reference that you shared @chrstnkgn! :satisfied:

I'm not fully read the whole paper yet but already solved some questions by what you shared! Especially, p2-5, fig 1, table 3 are thoroughly informative and helped my mind to sure about converting uncertain(-1) is worth doing for the similarity between train and validation set.

seoulsky-field / CXRAIL-dev

Features: Asking for help to add new policies to convert CheXpert target class in our custom Dataset Class #9

What

Why

How