Open rhycha opened 3 months ago
oversampler. These terms refer to different techniques for handling imbalanced datasets, particularly through oversampling methods. Here’s a detailed explanation of each:
sampling_strategy
: Specifies the ratio or strategy for sampling.random_state
: Controls the random seed for reproducibility.sampling_strategy
: Defines the desired ratio of the classes after resampling.random_state
: Controls the random seed for reproducibility.sampling_strategy
: Defines the desired sampling strategy.random_state
: Controls the random seed for reproducibility.k_neighbors
: Number of nearest neighbors to use for generating synthetic samples.sampling_strategy
: Defines the sampling strategy.random_state
: Controls the random seed for reproducibility.k_neighbors
: Number of nearest neighbors to use for generating synthetic samples.sampling_strategy
: Defines the sampling strategy.random_state
: Controls the random seed for reproducibility.k_neighbors
: Number of nearest neighbors to use for generating synthetic samples.sampling_strategy
: Defines the sampling strategy.random_state
: Controls the random seed for reproducibility.k_neighbors
: Number of nearest neighbors to use for generating synthetic samples.Each technique has its strengths and is chosen based on the characteristics of the dataset and the nature of the classification problem.
guess it seems enough with ADASYN RandomOverSampler SMOTE especially for the interpolating one
The imbens.sampler
submodule provides various under-sampling techniques to handle imbalanced datasets. Under-sampling reduces the number of samples in the majority class to balance the class distribution. Below is an explanation of the different under-sampling samplers:
k
(number of neighbors).k
, AllKNN ensures that only the most robust samples from the majority class are retained. It combines the results of different Edited Nearest Neighbours passes.Each method offers a different approach to handling imbalanced datasets, and the choice of method depends on the specific characteristics of your data and the goals of your machine learning task.
seems not easy to use since it is usually it should know what is major class is.
그렇다고 무작정 클러스터 늘릴 수도 없고,, majority minority는 더 생각해봐야 할듯.
75 undersample 25 oversample it seems not different but
this is 90 oversample and 90 undersample
seems very similar with test distribution( but seems too much)
전략 레코드가 많은 곳은 undersampling, 적은 곳은 oversampling. 이후 y 값이 낮은 곳 undersampling, 높은 곳 oversampling. 하기전에 minmax 하자.