Regressors retained when t.pval=0

gsucarrat commented 1 year ago

When t.pval=0, then no regressors should be retained. However, this is not always the case, as the following code illustrates:

set.seed(1)
x1 <- rnorm(100)
x2 <- rnorm(100)
x3 <- rnorm(100)
x4 <- rnorm(100)
y <- 8+2*x1+3*x2+4*x3+rnorm(100)
getsFun(y, cbind(1,x1,x2,x3,x4), t.pval=0) #erroneously retains 1 and 4

Thanks to Santiago Diaz Espitia for notifying us about the issue. The problem may affect all functions that rely on getsFun(). This includes, amongst other, gets.arx(), gets.logitx(), gets.lm() and isat.default().

SantiagoD999 commented 1 year ago

It works perfectly with the getsm function:

set.seed(1) x1<-rnorm(100) x2<-rnorm(100) x3<-rnorm(100) x4<-rnorm(100) y<-8+2x1+3x2+4*x3+rnorm(100) Y<-zoo(cbind(y,x1,x2,x3,x4))

getsm(arx(Y[,1],mxreg=Y[,-1]),t.pval=0,do.pet = FALSE)

jkurle commented 1 year ago

Hi Santiago and thank you for raising this issue. You probably already delivered the solution yourself.

In your original code and Genaro's example, you do the PET. While in your comment above, you turn PET off. If we modify Genaro's code to also turn PET off, we find the expected result that the terminal model is empty.

set.seed(1)
x1 <- rnorm(100)
x2 <- rnorm(100)
x3 <- rnorm(100)
x4 <- rnorm(100)
y <- 8+2*x1+3*x2+4*x3+rnorm(100)
getsFun(y, cbind(1,x1,x2,x3,x4), t.pval=0, do.pet=FALSE) # best terminal model is empty, as expected

Now, what is going on here? When setting t.pval = 0 without specifying wald.pval separately, then also wald.pval = 0 by default. So we would expect that the PET never rejects. However, at least in this example and potentially other cases, the PET yields test statistics that are huge (well above 4500), which is so large that R actually returns a p-value of 0 for this. Therefore, the deleted regressor is added as a keep variable and we end up with terminal models that are not empty. See code snippet below:

https://github.com/gsucarrat/gets/blob/85b73e62e24520548893a6fa15b3bbe4f1fc8824/gets/R/gets-base-source.R#L1613-L1632

So basically, this behaviour is due to rounding/machine imprecision.

How do we deal with this? I'm not sure how relevant this case is in practice since users can set do.pet=FALSE if they want to use t.pval=0. This still does not guarantee that we delete all regressors because in theory, we can also get t-statistics for individual regressors that are so big that their p-value is returned as exactly 0, again due to rounding. @SantiagoD999 Do you mind explaining your use case of this setting as a guidance for how to deal with this?

SantiagoD999 commented 1 year ago

Thanks for your reply professor Kurle, although the code gives the expected results when using the getsm function, if you set the do.PET=FALSE in the isat function this does not work as expected. This is due to machine rounding imprecision? I have been working with the package in my research and writing the code I wanted to check if all of my code functioned correctly, thereby I wanted to compare the model with only a constant to a model with many variables but with t.pval=0 to check if all was correct and if I did any mistakes in any part of my code. But you are right, in terms of practical use it is not relevant.

set.seed(1) x1<-rnorm(100) x2<-rnorm(100) x3<-rnorm(100) x4<-rnorm(100) y<-8+2x1+3x2+4*x3+rnorm(100) Y<-zoo(cbind(y,x1,x2,x3,x4))

isat(Y[,1],uis=Y[,-1],sis=FALSE,t.pval=0,blocks = 1,do.pet=FALSE)

Thank you.

jkurle commented 1 year ago

Ah, interesting, there is a second issue going on. So in addition to my previous comment, which should solve the problem with getsFun(), there is a special problem with isat(). The reason is the argument include.gum in isat().

Currently, the documentation says that the include.gum argument is deprecated. I am not sure when this has been introduced originally but I think it has been there for a while. I also don't know the reason. Effectively, the argument is always set to TRUE:

https://github.com/gsucarrat/gets/blob/85b73e62e24520548893a6fa15b3bbe4f1fc8824/gets/R/gets-isat-source.R#L116-L120

Therefore, by definition, one of the terminal models is the GUM, which in your case is the intercept plus the four potential regressors x1 to x4. Since the GUM significantly improves the constant-only model, it is chosen as the best terminal model (based on the information criterion).

As an experiment, I have created a branch where the include.gum argument is not ignored and here the function behaves as expected. See code below if you want to try it out. It installs gets from the new branch, so make sure to install the released, official version again afterwards:

# install new branch
devtools::install_github(repo="gsucarrat/gets", ref="detective", subdir="gets")

set.seed(1)
x1<-rnorm(100)
x2<-rnorm(100)
x3<-rnorm(100)
x4<-rnorm(100)
y<-8+2*x1+3*x2+4*x3+rnorm(100)
Y<-zoo(cbind(y,x1,x2,x3,x4))

isat(Y[,1],uis=Y[,-1],sis=FALSE,t.pval=0,blocks = 1,do.pet=FALSE, include.gum=TRUE) # gives same result as before
isat(Y[,1],uis=Y[,-1],sis=FALSE,t.pval=0,blocks = 1,do.pet=FALSE, include.gum=FALSE) # selects intercept-only model

SantiagoD999 commented 1 year ago

It worked perfectly, thank you very much.

jkurle commented 1 year ago

Glad to hear!

@gsucarrat: How do we deal with the include.gum argument being ignored/deprecated? Apparently, it was last modified (added?) in the commit 6f424a50b42e9c4c336a82cd08ef9ef167dff3c9 on 2/10/2020.

gsucarrat commented 1 year ago

First of all Jonas, thanks a lot! As for include.gum, I think I was the one who deprecated the argument, but I do not exactly remember why... We could try to simply re-enable the argument in isat() and see what happens. If something breaks, then maybe an idea could be to handle the situation t.pval=0 specifically?

gsucarrat / gets

Regressors retained when t.pval=0 #67