oversampling, undersampling.

rhycha commented 3 months ago

전략 레코드가 많은 곳은 undersampling, 적은 곳은 oversampling. 이후 y 값이 낮은 곳 undersampling, 높은 곳 oversampling. 하기전에 minmax 하자.

많은 곳이거나 y값이 낮으면 undersamling, 적은 곳이거나 y값이 높으면 oversampling하면 test data분포를 정확히 얻을 수 있음.

rhycha commented 3 months ago

https://imbalanced-ensemble.readthedocs.io/en/latest/api/sampler/under-samplers.html

rhycha commented 3 months ago

oversampler. These terms refer to different techniques for handling imbalanced datasets, particularly through oversampling methods. Here’s a detailed explanation of each:

1. ADASYN (Adaptive Synthetic Sampling):

Purpose: ADASYN is an advanced version of the SMOTE algorithm that generates synthetic samples for the minority class. The key idea is to adaptively generate more synthetic data in regions where the minority class is sparsely represented.
How it Works: ADASYN calculates the density distribution of the minority class and focuses on generating more synthetic samples in regions where the density is low (i.e., where there are fewer minority samples). This helps to balance the class distribution in a more targeted way, reducing bias towards the majority class.
Parameters:
- sampling_strategy: Specifies the ratio or strategy for sampling.
- random_state: Controls the random seed for reproducibility.

2. RandomOverSampler:

Purpose: This is the simplest oversampling method. It randomly duplicates samples from the minority class to balance the class distribution.
How it Works: The algorithm randomly picks samples from the minority class and adds them to the dataset until the desired class balance is achieved. This can be a quick way to handle imbalance but may lead to overfitting, as it doesn't introduce any new information (just duplicates existing samples).
Parameters:
- sampling_strategy: Defines the desired ratio of the classes after resampling.
- random_state: Controls the random seed for reproducibility.

3. KMeansSMOTE:

Purpose: This technique combines clustering with SMOTE to generate synthetic samples in a more informed manner.
How it Works: First, the algorithm applies KMeans clustering to the minority class samples. Then, SMOTE is used to generate synthetic samples within each cluster, ensuring that the new samples are generated in a way that reflects the underlying structure of the data.
Parameters:
- sampling_strategy: Defines the desired sampling strategy.
- random_state: Controls the random seed for reproducibility.
- k_neighbors: Number of nearest neighbors to use for generating synthetic samples.

4. SMOTE (Synthetic Minority Over-sampling Technique):

Purpose: SMOTE is a popular oversampling technique that generates synthetic samples for the minority class by interpolating between existing minority samples.
How it Works: SMOTE selects a sample from the minority class and finds its k-nearest neighbors. It then generates new samples by randomly interpolating between the selected sample and its neighbors, effectively creating new points along the line segments that connect them.
Parameters:
- sampling_strategy: Defines the sampling strategy.
- random_state: Controls the random seed for reproducibility.
- k_neighbors: Number of nearest neighbors to use for generating synthetic samples.

5. BorderlineSMOTE:

Purpose: A variant of SMOTE that focuses on samples that lie near the decision boundary between classes, which are the most likely to be misclassified.
How it Works: BorderlineSMOTE identifies samples in the minority class that are close to the boundary with the majority class (i.e., "borderline" samples) and generates synthetic samples near these points. The idea is to make the decision boundary clearer by providing more data points in these critical regions.
Parameters:
- sampling_strategy: Defines the sampling strategy.
- random_state: Controls the random seed for reproducibility.
- k_neighbors: Number of nearest neighbors to use for generating synthetic samples.

6. SVMSMOTE (Support Vector Machine SMOTE):

Purpose: Another variant of SMOTE, SVMSMOTE uses a support vector machine (SVM) to identify the support vectors (critical points that define the decision boundary) and generates synthetic samples in the region near these vectors.
How it Works: SVMSMOTE uses an SVM classifier to find the support vectors of the minority class and then generates synthetic samples near these vectors. This method focuses on the hardest-to-classify samples, enhancing the minority class representation in regions that are crucial for classification.
Parameters:
- sampling_strategy: Defines the sampling strategy.
- random_state: Controls the random seed for reproducibility.
- k_neighbors: Number of nearest neighbors to use for generating synthetic samples.

Summary:

ADASYN: Focuses on regions where the minority class is underrepresented.
RandomOverSampler: Randomly duplicates minority samples to balance the dataset.
KMeansSMOTE: Combines clustering with SMOTE to generate synthetic samples in a more structured way.
SMOTE: Generates synthetic samples by interpolating between existing minority samples.
BorderlineSMOTE: Focuses on generating samples near the decision boundary.
SVMSMOTE: Uses SVM to find support vectors and generates samples near the decision boundary.

Each technique has its strengths and is chosen based on the characteristics of the dataset and the nature of the classification problem.

rhycha commented 3 months ago

guess it seems enough with ADASYN RandomOverSampler SMOTE especially for the interpolating one

rhycha commented 3 months ago

The imbens.sampler submodule provides various under-sampling techniques to handle imbalanced datasets. Under-sampling reduces the number of samples in the majority class to balance the class distribution. Below is an explanation of the different under-sampling samplers:

1. ClusterCentroids:

Purpose: This method generates centroids for the majority class samples using clustering methods (e.g., KMeans) and then replaces the original samples with these centroids.
How it Works: It clusters the majority class data points and reduces the number of samples by keeping only the cluster centroids, effectively condensing the data while maintaining the overall distribution.
Use Case: Useful when you want to reduce the size of the majority class without losing its distribution characteristics.

2. RandomUnderSampler:

Purpose: The simplest form of under-sampling, where samples from the majority class are randomly removed until a desired class balance is achieved.
How it Works: This method randomly selects and removes samples from the majority class. It can be effective but may result in the loss of important data points.
Use Case: Suitable for a quick and straightforward reduction in class imbalance, but may lead to information loss if not used carefully.

3. InstanceHardnessThreshold:

Purpose: This method under-samples the majority class based on the "instance hardness" of the samples.
How it Works: Instance hardness measures how difficult it is to classify a particular sample correctly. The harder it is to classify a sample, the more likely it is to be retained. This method removes majority class samples that are easy to classify, focusing on the harder-to-classify instances.
Use Case: Ideal for preserving the most informative majority class samples while reducing the overall size of the class.

4. NearMiss:

Purpose: This method selects majority class samples that are closest to the minority class samples (based on different strategies).
How it Works: There are different versions of NearMiss:
- NearMiss-1: Selects majority samples whose average distance to the three closest minority samples is the smallest.
- NearMiss-2: Selects majority samples that are closest to the most distant minority sample.
- NearMiss-3: Combines the first two approaches.
Use Case: Effective when you want to focus on majority class samples that are near the decision boundary, helping to refine the model's ability to distinguish between classes.

5. TomekLinks:

Purpose: This method removes samples that form "Tomek links," which are pairs of samples from different classes that are nearest neighbors to each other.
How it Works: Tomek links indicate that the samples are close to the decision boundary, and by removing the majority class sample in the pair, you effectively clean up the boundary, making it clearer.
Use Case: Useful for cleaning up the dataset by removing ambiguous samples that might confuse the classifier.

6. EditedNearestNeighbours:

Purpose: This method removes samples from the majority class that are misclassified by their k-nearest neighbors.
How it Works: After identifying the nearest neighbors for each sample, majority class samples that are incorrectly classified by their neighbors are removed. This helps in refining the decision boundary.
Use Case: Effective in reducing noise in the majority class, especially when dealing with complex decision boundaries.

7. RepeatedEditedNearestNeighbours:

Purpose: A variation of Edited Nearest Neighbours that applies the editing process multiple times.
How it Works: This method repeatedly removes misclassified majority samples until no further samples can be removed. This iterative process ensures a cleaner dataset.
Use Case: Useful when a single pass of Edited Nearest Neighbours isn't sufficient to achieve the desired class balance or noise reduction.

8. AllKNN:

Purpose: A more aggressive under-sampling technique that applies the Edited Nearest Neighbours method with different values of k (number of neighbors).
How it Works: By using multiple values of k, AllKNN ensures that only the most robust samples from the majority class are retained. It combines the results of different Edited Nearest Neighbours passes.
Use Case: Ideal for ensuring that only the most informative and least noisy majority class samples are retained.

9. OneSidedSelection:

Purpose: This method combines Tomek Links and Edited Nearest Neighbours to perform under-sampling.
How it Works: First, Tomek Links are removed to clean up the decision boundary. Then, Edited Nearest Neighbours are applied to remove noisy samples. This two-step process is more robust than using either method alone.
Use Case: Effective for datasets with noisy majority class samples and unclear decision boundaries.

10. CondensedNearestNeighbour:

Purpose: This method aims to retain only a subset of the majority class that is sufficient to classify the data correctly.
How it Works: Condensed Nearest Neighbour starts with an initial subset of the data and iteratively adds samples that are misclassified by the current subset. The goal is to condense the majority class to the minimum number of samples needed to maintain classification accuracy.
Use Case: Useful when you want to drastically reduce the size of the majority class while maintaining classification performance.

11. NeighbourhoodCleaningRule:

Purpose: This method refines the decision boundary by cleaning up both majority and minority class samples based on their neighbors.
How it Works: NeighbourhoodCleaningRule applies a combination of Edited Nearest Neighbours and Tomek Links to both classes, removing noisy and borderline samples from both the majority and minority classes.
Use Case: Ideal for datasets with noisy labels or unclear boundaries, where both classes need cleaning.

12. BalanceCascadeUnderSampler:

Purpose: This method is used in ensemble learning to perform sequential under-sampling.
How it Works: In each iteration, the model is trained on a balanced subset of the data. Misclassified samples from the majority class are removed in subsequent iterations, leading to a more balanced and refined training set.
Use Case: Effective in ensemble methods, especially when sequentially reducing the size of the majority class.

13. SelfPacedUnderSampler:

Purpose: This method progressively reduces the majority class by focusing on easy-to-classify samples first, followed by more difficult samples in a self-paced manner.
How it Works: SelfPacedUnderSampler iteratively removes majority class samples, starting with the easiest-to-classify samples and moving towards more difficult ones. This helps in gradually refining the dataset.
Use Case: Suitable for situations where you want to progressively reduce the class imbalance while retaining the most informative samples.

Summary:

Clustering-based Methods (e.g., ClusterCentroids): Focus on reducing the majority class by condensing it into representative centroids.
Random Methods (e.g., RandomUnderSampler): Randomly remove samples from the majority class.
Nearest Neighbor Methods (e.g., EditedNearestNeighbours): Remove samples based on their neighbors to refine the decision boundary.
Noise and Boundary Cleaning Methods (e.g., TomekLinks, NeighbourhoodCleaningRule): Focus on removing ambiguous or noisy samples that may confuse the classifier.
Ensemble and Sequential Methods (e.g., BalanceCascadeUnderSampler): Use iterative or ensemble-based approaches to under-sample the majority class in a more structured way.

Each method offers a different approach to handling imbalanced datasets, and the choice of method depends on the specific characteristics of your data and the goals of your machine learning task.

rhycha commented 3 months ago

seems not easy to use since it is usually it should know what is major class is.

rhycha commented 3 months ago

그렇다고 무작정 클러스터 늘릴 수도 없고,, majority minority는 더 생각해봐야 할듯.

rhycha commented 3 months ago

output_ycut

75 undersample 25 oversample it seems not different but

rhycha commented 3 months ago

output_y90cut

this is 90 oversample and 90 undersample

rhycha commented 3 months ago

seems very similar with test distribution( but seems too much) output_test

kyungheee / 2024-Samsung-AI-Challenge-Black-box-Optimization