paobranco / UBL

An R package for utility-based learning
32 stars 11 forks source link

SmoteRegress "All the points have relevance 0" #6

Closed zym604 closed 5 years ago

zym604 commented 5 years ago

The example code doesn't work if you change the target from Sepal.Width to any other variables. For example, change to Sepal.Length:

smoteBalan.iris <- SmoteRegress(Sepal.Length~., ir, dist = "HEOM",
                                C.perc = "balance")

It returns:

>   smoteBalan.iris <- SmoteRegress(Sepal.Length~., ir, dist = "HEOM",
+                                 C.perc = "balance")
Error in SmoteRegress(Sepal.Length ~ ., ir, dist = "HEOM", C.perc = "balance") : 
  All the points have relevance 0. 
         Please, redefine your relevance function!
Execution halted

I think maybe it is a problem of the probability distribution? Because Sepal.Length is the only variable that looks continuous in the whole domain: image image image image

But I'm not sure what exactly happens. Please give me some hint.

paobranco commented 5 years ago

Hi,

The problem is related with the relevance function. The resampling methods implemented use a relevance function that provides information regarding which are the important and unimportant values of the target variable. This will allow the resampling methods to know where they will apply oversampling and undersampling.

The relevance function assigns a higher score (near 1) to the most important values and a lower score (near zero) to the less interesting values of the target variable.

In UBL package we use as default an automatic method for estimating the relevance function. This automatic method has assumptions regarding, for instance, the location of the important values (they are expected to be in the extremes of the target variable distribution).

However, some times, the automatic method provides a constant relevance, i.e., assigns to all values the same relevance/importance. In your error, all values were assigned a relevance of zero in this automatic method.

But the resampling methods need to differentiate between important and unimportant cases and this is impossible with a constant relevance. This is why the message also asks you to redefine the relevance function.

There are alternatives to the automatic method in UBL package. For instance, you can provide your own relevance function through a 3-column matrix. The example code that you mention also has an example with a user defined relevance function.

Essentially, it works as follows:

For example, the row (2,1,0) means that for the target variable value of 2 we will assign a relevance score of 1 and the relevance function derivative will be zero.

You can find more about the relevance function here, here, here and here.

Let me know if you have any doubts. Thank you for your interest in UBL package!

zym604 commented 5 years ago

Thank you @paobranco !

I have read the PhD thesis of Dr. Rita P. Ribeiro, but didn't understood how to implement my own relevance function. Your reply helps a lot!

But I still have some questions:

  1. Is the current SMOTER only work when the target variable is nearly normal distributed? I ask this because of the following statement from Dr. Ribeiro's thesis: image So maybe that's the reason why Sepal.Width is the only target variables that could be implemented, as other variables are not normally distributed. Am I right?

  2. What is the mathematical form of the automatic function? It is stated in a paper that the function is inversely proportional to the target variable pdf. image But I think the real function should be a little different right? Otherwise, it couldn't explain the "all zero" error. I tried to check the function by searching it in the code. But it seems that the function is written in a Fortran object file, which I couldn't open. So the only thing I can do is ask you directly the math form. image

  3. Doesn't current SMOTER support window platform? I tried to install UBL on windows, but I failed because of some gcc stuff.

Anyway, thank you and your group for making such a contribution!

IrisOren commented 5 years ago

Hi, I was having gcc error issues installing SMOTER on windows.

I finally got it installed by first installing Rtools on C: (https://cran.r-project.org/bin/windows/Rtools/), and then within R, install.packages("devtools"). After that, the smoteR could be built and installed.

Maybe this will help you?

rpribeiro commented 5 years ago

First of all, thank you for your interest in UBL and on the relevance function phi, in particular.

The automatic method for obtaining the relevance is based on the assumption that the most important values are the ones considered as outliers by the boxplot. It is based on this premise that we say that "that the function is inversely proportional to the target variable pdf". So, this is true considering that our focus is on rare extreme values and that they exist. Thus, it assumes a normal distribution for the continuous variable and uses the following points to interpolate and obtain the respective relevance function. The median has relevance 0, the upper and lower whiskers have relevance 1, given that there are outliers.

So, in your case, if you do

y <- iris$Sepal.Length phiF.args1 <- phi.control(y,method="extremes",extr.type="both") y.phi1 <- phi(y, control.parms=phiF.args1) plot(y, y.phi1)

you get a constant value function because there are no outliers according to the boxplot.

boxplot.stats(y)

$stats [1] 4.3 5.1 5.8 6.4 7.9

$n [1] 150

$conf [1] 5.632292 5.967708

$out numeric(0)

From that perspective, the automatic method assumes that nothing is relevant. (We will probably change that.)

Anyway, you can still force the relevance function to interpolate the values you wish.

s <- boxplot.stats(y)$stats rel <- matrix(0, ncol = 3, nrow = 0) rel <- rbind(rel, c(s[1], 1, 0)) rel <- rbind(rel, c(s[3], 0, 0)) rel <- rbind(rel, c(s[5], 1, 0))

phiF.args2 <- phi.control(y,method="range",control.pts=rel) y.phi2 <- phi(y, control.parms=phiF.args2) plot(y,y.phi2)

I hope my answer has helped you in clarifying your issue. Contact me, if you still have doubts.

Best regards, Rita

luna57-lr commented 2 years ago

Thank you for your reply! I also met the same problem that"Error in RandOverRegress(RPA ~ ., datai, C.perc = "balance") : All the points have relevance 1. Please, redefine your relevance function!' How should i define the relevance function ? My boxplot.stats(datai$RPA)$stats are all 0!