kyungheee / 2024-Samsung-AI-Challenge-Black-box-Optimization

2024 Samsung AI Challenge : Black-box Optimization
0 stars 0 forks source link

oversampling, undersampling. #29

Open rhycha opened 1 month ago

rhycha commented 1 month ago

전략 레코드가 많은 곳은 undersampling, 적은 곳은 oversampling. 이후 y 값이 낮은 곳 undersampling, 높은 곳 oversampling. 하기전에 minmax 하자.

  1. 많은 곳이거나 y값이 낮으면 undersamling, 적은 곳이거나 y값이 높으면 oversampling하면 test data분포를 정확히 얻을 수 있음.
rhycha commented 1 month ago

https://imbalanced-ensemble.readthedocs.io/en/latest/api/sampler/under-samplers.html

rhycha commented 1 month ago

oversampler. These terms refer to different techniques for handling imbalanced datasets, particularly through oversampling methods. Here’s a detailed explanation of each:

1. ADASYN (Adaptive Synthetic Sampling):

2. RandomOverSampler:

3. KMeansSMOTE:

4. SMOTE (Synthetic Minority Over-sampling Technique):

5. BorderlineSMOTE:

6. SVMSMOTE (Support Vector Machine SMOTE):

Summary:

Each technique has its strengths and is chosen based on the characteristics of the dataset and the nature of the classification problem.

rhycha commented 1 month ago

guess it seems enough with ADASYN RandomOverSampler SMOTE especially for the interpolating one

rhycha commented 1 month ago

The imbens.sampler submodule provides various under-sampling techniques to handle imbalanced datasets. Under-sampling reduces the number of samples in the majority class to balance the class distribution. Below is an explanation of the different under-sampling samplers:

1. ClusterCentroids:

2. RandomUnderSampler:

3. InstanceHardnessThreshold:

4. NearMiss:

5. TomekLinks:

6. EditedNearestNeighbours:

7. RepeatedEditedNearestNeighbours:

8. AllKNN:

9. OneSidedSelection:

10. CondensedNearestNeighbour:

11. NeighbourhoodCleaningRule:

12. BalanceCascadeUnderSampler:

13. SelfPacedUnderSampler:

Summary:

Each method offers a different approach to handling imbalanced datasets, and the choice of method depends on the specific characteristics of your data and the goals of your machine learning task.

rhycha commented 1 month ago

seems not easy to use since it is usually it should know what is major class is.

rhycha commented 1 month ago

그렇다고 무작정 클러스터 늘릴 수도 없고,, majority minority는 더 생각해봐야 할듯.

rhycha commented 1 month ago

output_ycut

75 undersample 25 oversample it seems not different but

rhycha commented 1 month ago

output_y90cut

this is 90 oversample and 90 undersample

rhycha commented 1 month ago

seems very similar with test distribution( but seems too much) output_test