Closed akatav closed 4 years ago
Yes, using UBL you can generate synthetic data in bins where the data distribution is scarce or even non-existent!
You will not be able to do this with RandOverRegress
or ImpSampRegress
function because in this cases only replicas of existing cases are added to the new data set.
Functions SmoteRegress
and GaussNoiseRegress
will be able to do what you want!
I build an example to show you how it works.
Consider the Boston data set.
library(MASS)
data(Boston)
plot(sort(Boston$medv))
Let us remove all the data set cases whose variable medv
is between 35 and 40.
sp <- which(Boston$medv<40 & Boston$medv>35)
nBoston <- Boston[-sp,]
We just removed 17 values.
Variable medv
now is as follows:
To generate synthetic data in a specific range of your data set you need to define the relevance.
Defining the relevance function The relevance varies between 0 and 1. You should assign to the important bins a high relevance value and to the unimportant bins a low relevance value.
In our example, because there aren't cases between 35 and 40 this will be a highly relevant bin while the other ranges will not be important.
Variable medv
varies between 5 and 50, thus, I will assign a relevance of zero to the two extremes (5 and 50) and a relevance of 1 to the value of 37.5.
The relevance function (parameter rel
in UBL package), when provided by the user (which is what you want!), should be defined as a matrix where the first column has variable values, the second column has the relevance values and the third column has the derivative of the relevance function.
I will do the following:
for the value 5 of medv variable I'll assign a relevance of zero and a derivative of zero; for the value of 37.5 I'll assign a relevance of 1 and a derivative of zero; and for the values of 50 I'll assign a relevance of zero and a derivative of zero.
library(UBL)
myrel <- matrix(c(5, 0, 0, 37.5, 1,0, 50, 0, 0),
byrow = TRUE, ncol = 3, nrow = 3)
Define the over/under sampling percentages
Now, I've build 3 bins for the medv
variable. Because on the first and the last bin I don't want to do nothing (neither under- nor over-sampling) I'll define a sampling percentage of 1 which means don't do anything. For the bin in the middle, I'll assign a sampling percentage above 1, which means that I want to oversample that bin.
To do this I'll define the C.perc
parameter as follows:
myCperc <- list(1,5,1)
Now, you just need to use the SmoteRegress
function with these parameters:
smBoston <- SmoteRegress(medv~., nBoston,
rel = myrel,
C.perc = myCperc)
This is the resulting medv
:
If you have some scarce data in a bin
If you don't have any data in the bin, it will be more difficult to use GaussNoiseRegress
function, because this function adds new cases by introducing small perturbations into existing cases. Thus, you will be able to generate new cases near the border, but in the middle of the bin it will be more difficult and more risky because you will need to augment the perturbations allowed in the cases...
If you do have some cases but they are rare, then, the GaussNoiseRegress
function will be a good option too. See below an example of how to use it.
# define a new data set with only 3 cases with a medv value between 35 and 40
nsp <- sample(sp, 14)
n2Boston <- Boston[-nsp,]
# define the relevance and the percentages of over/under sampling
myrel <- matrix(c(5, 0, 0, 37.5, 1,0, 50, 0,0), byrow = TRUE, ncol = 3, nrow = 3)
myCperc <- list(1,5,1)
# generate the data set with the synthetic cases
gnBoston <- GaussNoiseRegress(medv~., n2Boston,
rel = myrel, thr.rel = 0.8, pert=0.3,
C.perc = myCperc)
The result on medv
variable is:
Thank you for your interest in UBL!
Hi Paula Thank you so much for the detailed reply! I really appreciate it.
I have tried adopting the approach above (with both Smote and GaussianRegress) on my dataset and i am humbly surprised at how less sparse the plot looks now. Thank you!
Hello Paula Thanks for this very helpful piece of software.I have been looking for quite sometime for software to generate synthetic data for regression problems. I have one question. Suppose i have a highly imbalanced high dimensional data set and trying to predict a continuous value, how can i achieve this with UBL ? more details here in my question
i see that using the R functions, GaussianNoise, RandomOverSampline, Undersampling like: mygn.alg=GaussNoiseRegress(clean.algae$a7~., clean.algae, C.perc="balance")
generates a subset from the original data whose values are the same as the original data sample. I was hoping that, the functions generate synthetic data in bins where data distribution is less or even nil. Is this achievable using UBL? can you please give an example of the same?
Thank you.