seoulsky-field / CXRAIL-dev

CXRAIL-dev
MIT License
7 stars 0 forks source link

Hotfix: Change the order of train_size in the preprocessing sequence #73

Closed kdg1993 closed 1 year ago

kdg1993 commented 1 year ago

What

Changing the order from restrict train_size by sampling -> frontal or lateral restriction -> enhancement to frontal or lateral restriction -> enhancement -> restrict train_size by sampling

Why

So far, the training data size restriction has been done in the early stage of data preprocessing. However, the current process returns fewer datasets than a given integer or float (thanks for noticing @seoulsky-field). For example, if you set train_size as 100 & use_frontal as True, the codeset samples 100 data and selects frontal images. Thus, it returns <= 100 images. To avoid this, I checked dataset options that affect the number of datasets and figured out use_frontal & enhancement (upsampling) can reduce or increase the number.

While analyzing the effects of these processing options, I figured out that enhancement is quite complicated and might return a result that far different from what the user expected. Currently, the enhancement accepts multiple target columns and n_times (the amount of upsampling). Since this enhancement works the target column independently (which means that does not consider co-effect), it duplicates more than given n_times due to the inherent trait of multi-label problem.

Here is a really simple example of the enhancing sequence in our codeset. original (3A, 4B) -> Enhancing 'A' 2-times (6A, 6B) -> Enhancing 'B' 2-times (8A, 10B) <- more than 2-times of 'A' and 'B' A B           A B              A B 1 0           1 0               1 0 1 1           1 1               1 1 1 1           1 1               1 1 0 1           0 1               0 1 0 1           0 1               0 1


           1 0               1 0            1 1               1 1            1 1               1 1


                           1 1                            1 1                            0 1                            0 1


It is difficult to determine which way of enhancing (upsampling) is right, but we should definitely recognize this.

How