Changing the order
from restrict train_size by sampling -> frontal or lateral restriction -> enhancement
to frontal or lateral restriction -> enhancement -> restrict train_size by sampling
Why
So far, the training data size restriction has been done in the early stage of data preprocessing.
However, the current process returns fewer datasets than a given integer or float (thanks for noticing @seoulsky-field).
For example, if you set train_size as 100 & use_frontal as True, the codeset samples 100 data and selects frontal images.
Thus, it returns <= 100 images.
To avoid this, I checked dataset options that affect the number of datasets and
figured out use_frontal & enhancement (upsampling) can reduce or increase the number.
While analyzing the effects of these processing options,
I figured out that enhancement is quite complicated and might return a result that far different from what the user expected.
Currently, the enhancement accepts multiple target columns and n_times (the amount of upsampling).
Since this enhancement works the target column independently (which means that does not consider co-effect),
it duplicates more than given n_times due to the inherent trait of multi-label problem.
Here is a really simple example of the enhancing sequence in our codeset.
original (3A, 4B) -> Enhancing 'A' 2-times (6A, 6B) -> Enhancing 'B' 2-times (8A, 10B) <- more than 2-times of 'A' and 'B'A B A B A B
1 0 1 0 1 0
1 1 1 1 1 1
1 1 1 1 1 1
0 1 0 1 0 1
0 1 0 1 0 1
1 0 1 0
1 1 1 1
1 1 1 1
1 1
1 1
0 1
0 1
It is difficult to determine which way of enhancing (upsampling) is right, but we should definitely recognize this.
How
[x] Code change
[x] Test the length of returning dataset (length of self.df)
What
Changing the order from restrict train_size by sampling -> frontal or lateral restriction -> enhancement to frontal or lateral restriction -> enhancement -> restrict train_size by sampling
Why
So far, the training data size restriction has been done in the early stage of data preprocessing. However, the current process returns fewer datasets than a given integer or float (thanks for noticing @seoulsky-field). For example, if you set train_size as 100 & use_frontal as True, the codeset samples 100 data and selects frontal images. Thus, it returns <= 100 images. To avoid this, I checked dataset options that affect the number of datasets and figured out use_frontal & enhancement (upsampling) can reduce or increase the number.
While analyzing the effects of these processing options, I figured out that enhancement is quite complicated and might return a result that far different from what the user expected. Currently, the enhancement accepts multiple target columns and n_times (the amount of upsampling). Since this enhancement works the target column independently (which means that does not consider co-effect), it duplicates more than given n_times due to the inherent trait of multi-label problem.
Here is a really simple example of the enhancing sequence in our codeset. original (3A, 4B) -> Enhancing 'A' 2-times (6A, 6B) -> Enhancing 'B' 2-times (8A, 10B) <- more than 2-times of 'A' and 'B' A B A B A B 1 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 0 1 0 1 0 1 0 1
1 0 1 0 1 1 1 1 1 1 1 1
1 1 1 1 0 1 0 1
It is difficult to determine which way of enhancing (upsampling) is right, but we should definitely recognize this.
How