paobranco / UBL

An R package for utility-based learning
32 stars 11 forks source link

Generating synthetic data for high dimensional dataset #8

Closed akatav closed 4 years ago

akatav commented 4 years ago

Hello Paula Thanks for this very helpful piece of software.I have been looking for quite sometime for software to generate synthetic data for regression problems. I have one question. Suppose i have a highly imbalanced high dimensional data set and trying to predict a continuous value, how can i achieve this with UBL ? more details here in my question

i see that using the R functions, GaussianNoise, RandomOverSampline, Undersampling like: mygn.alg=GaussNoiseRegress(clean.algae$a7~., clean.algae, C.perc="balance")

generates a subset from the original data whose values are the same as the original data sample. I was hoping that, the functions generate synthetic data in bins where data distribution is less or even nil. Is this achievable using UBL? can you please give an example of the same?

Thank you.

paobranco commented 4 years ago

Yes, using UBL you can generate synthetic data in bins where the data distribution is scarce or even non-existent!

You will not be able to do this with RandOverRegress or ImpSampRegress function because in this cases only replicas of existing cases are added to the new data set.

Functions SmoteRegress and GaussNoiseRegress will be able to do what you want! I build an example to show you how it works.

Consider the Boston data set.

library(MASS)
data(Boston)
plot(sort(Boston$medv))

Rplot0

Let us remove all the data set cases whose variable medv is between 35 and 40.

sp <- which(Boston$medv<40 & Boston$medv>35)
nBoston <- Boston[-sp,]

We just removed 17 values. Variable medv now is as follows: Rplot1 To generate synthetic data in a specific range of your data set you need to define the relevance.

Defining the relevance function The relevance varies between 0 and 1. You should assign to the important bins a high relevance value and to the unimportant bins a low relevance value.

In our example, because there aren't cases between 35 and 40 this will be a highly relevant bin while the other ranges will not be important. Variable medv varies between 5 and 50, thus, I will assign a relevance of zero to the two extremes (5 and 50) and a relevance of 1 to the value of 37.5.

The relevance function (parameter rel in UBL package), when provided by the user (which is what you want!), should be defined as a matrix where the first column has variable values, the second column has the relevance values and the third column has the derivative of the relevance function. I will do the following: for the value 5 of medv variable I'll assign a relevance of zero and a derivative of zero; for the value of 37.5 I'll assign a relevance of 1 and a derivative of zero; and for the values of 50 I'll assign a relevance of zero and a derivative of zero.

library(UBL)
myrel <- matrix(c(5, 0, 0, 37.5, 1,0, 50, 0, 0),
         byrow = TRUE, ncol = 3, nrow = 3)

Define the over/under sampling percentages Now, I've build 3 bins for the medv variable. Because on the first and the last bin I don't want to do nothing (neither under- nor over-sampling) I'll define a sampling percentage of 1 which means don't do anything. For the bin in the middle, I'll assign a sampling percentage above 1, which means that I want to oversample that bin. To do this I'll define the C.perc parameter as follows:

myCperc <- list(1,5,1)

Now, you just need to use the SmoteRegress function with these parameters:

smBoston <- SmoteRegress(medv~., nBoston,
                              rel = myrel,
                              C.perc = myCperc)

This is the resulting medv: Rplot2

If you have some scarce data in a bin If you don't have any data in the bin, it will be more difficult to use GaussNoiseRegress function, because this function adds new cases by introducing small perturbations into existing cases. Thus, you will be able to generate new cases near the border, but in the middle of the bin it will be more difficult and more risky because you will need to augment the perturbations allowed in the cases...

If you do have some cases but they are rare, then, the GaussNoiseRegress function will be a good option too. See below an example of how to use it.

# define a new data set with only 3 cases with a medv value between 35 and 40 
nsp <- sample(sp, 14)
n2Boston <- Boston[-nsp,]
# define the relevance and the percentages of over/under sampling
myrel <- matrix(c(5, 0, 0, 37.5, 1,0, 50, 0,0), byrow = TRUE, ncol = 3, nrow = 3)
myCperc <- list(1,5,1)
# generate the data set with the synthetic cases
gnBoston <- GaussNoiseRegress(medv~., n2Boston,
                              rel = myrel, thr.rel = 0.8, pert=0.3,
                              C.perc = myCperc)

The result on medv variable is: Rplot3

Thank you for your interest in UBL!

akatav commented 4 years ago

Hi Paula Thank you so much for the detailed reply! I really appreciate it.

I have tried adopting the approach above (with both Smote and GaussianRegress) on my dataset and i am humbly surprised at how less sparse the plot looks now. Thank you!