gsucarrat / gets

R Package for General-to-Specific (GETS) modelling and Indicator Saturation (ISAT) methods
8 stars 5 forks source link

Explicit error message for character columns in estimation data #14

Open moritzpschwarz opened 4 years ago

moritzpschwarz commented 4 years ago

I think we should improve the error that occurs when a character or Date column is in the estimation data.

The current error that is printed is not directly pointing to the issue:

Error in qr.default(x, tol, LAPACK = LAPACK) : 
  NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning message:
In storage.mode(x) <- "double" : NAs introduced by coercion

I think we should strive towards something like: "Please remove all non-numeric columns from your estimation data" or alternatively just ignore the columns in the estimation. What do you think?

Below a quick example to illustrate the error.


y <- c(0, -0.5, -0.3, 0.2, -0.3, -0.2, -1.3, -0.1, 0.3, 0.8, -0.9, 
0.2, 0.5, 0.9, 0.2, 0.2, -0.5, 0, -0.1, -1.3, -0.3, 0.2, -1.1, 
0.1, -0.5, 0.8, -0.6, 1, -0.5, -0.6, -0.2, -0.6, -1.3, 0.2, -1, 
-1.3, -0.2, 0.8, -0.3, 0.2, 0.7, -0.4, -0.4, 0.5, -0.2, -0.6, 
-0.1, -0.8, -0.4, -0.2, -1.5, 0.1, 0.9, -0.1, -0.2, -0.1, -0.2, 
-0.2, -0.4, -1.1, 0.7, -0.2, 0.1, 0.6, -0.1, 0.4, 0, -0.2, 0, 
0.3, -0.3, 0, 0.3, -0.2, 0.3, -0.3, -0.7, -0.3, -0.6, -0.2, -0.5, 
0, -1.6, -0.1, 0.1, 0, 0.5, -0.2, -0.2, 0.1, -0.5, -0.8, -0.2, 
-0.2, -0.1, 0, -0.8, -0.8, -0.3, 0.5, 1)

mX<- structure(list(Date = c("2008-08-28", "2008-09-02", "2008-09-03", 
"2008-09-04", "2008-09-05", "2008-09-08", "2008-09-09", "2008-09-10", 
"2008-09-11", "2008-09-12", "2008-09-15", "2008-09-17", "2008-09-18", 
"2008-09-19", "2008-09-22", "2008-09-23", "2008-09-24", "2008-09-25", 
"2008-09-26", "2008-09-29", "2008-10-01", "2008-10-02", "2008-10-06", 
"2008-10-07", "2008-10-08", "2008-10-09", "2008-10-10", "2008-10-14", 
"2008-10-15", "2008-10-16", "2008-10-20", "2008-10-21", "2008-10-22", 
"2008-10-23", "2008-10-24", "2008-10-27", "2008-10-28", "2008-10-29", 
"2008-10-30", "2008-11-03", "2008-11-04", "2008-11-05", "2008-11-06", 
"2008-11-07", "2008-11-10", "2008-11-12", "2008-11-13", "2008-11-17", 
"2008-11-18", "2008-11-19", "2008-11-20", "2008-11-21", "2008-11-24", 
"2008-11-25", "2008-11-26", "2008-12-01", "2008-12-02", "2008-12-03", 
"2008-12-04", "2008-12-05", "2008-12-08", "2008-12-09", "2008-12-10", 
"2008-12-11", "2008-12-12", "2008-12-15", "2008-12-16", "2008-12-18", 
"2008-12-19", "2008-12-22", "2008-12-23", "2008-12-29", "2009-01-02", 
"2009-01-05", "2009-01-06", "2009-01-07", "2009-01-08", "2009-01-09", 
"2009-01-12", "2009-01-13", "2009-01-14", "2009-01-15", "2009-01-20", 
"2009-01-21", "2009-01-22", "2009-01-23", "2009-01-26", "2009-01-27", 
"2009-01-28", "2009-01-29", "2009-02-02", "2009-02-03", "2009-02-04", 
"2009-02-05", "2009-02-06", "2009-02-09", "2009-02-10", "2009-02-11", 
"2009-02-12", "2009-02-17", "2009-02-18"), x = c(0.3, 5.5, -0.3, 
-0.4, 0.3, -0.4, 0, 0.5, 0.1, 0, -0.3, 0.4, -0.4, 0.1, -0.2, 
-0.2, -0.7, 0.1, 0.2, -0.1, 3.9, -0.1, -0.3, -0.7, -0.4, -1.3, 
-1, 1.3, -2.2, -0.2, 0.3, -0.2, 1, 0.1, 0.7, -1.3, -0.1, -0.7, 
-0.2, 2.1, 0.4, -1.2, -1, -0.4, -1.9, -1.4, -1.9, 0.8, -0.3, 
0, -0.3, 0.4, -1.5, 0.2, 0.1, -0.1, -1.3, -0.1, 0.2, -0.2, -0.3, 
-0.3, 0.6, 1.7, -0.4, -0.9, -0.7, -0.6, 0.6, -0.4, 0.2, 1.3, 
-2.9, 0.6, 2.3, -0.9, -0.5, -0.1, -0.3, 0.4, 1, 2.2, -4.6, -0.8, 
1, 0.6, 0.2, 0.5, 1.1, 0.4, -0.5, -0.6, -0.5, 0.7, 0.4, 0, -0.7, 
-2.4, -1.9, -1.9, 0.3)), class = "data.frame", row.names = c(NA, 
-101L))

arx(y=y,mxreg = mX)
gsucarrat commented 4 years ago

I agree, a more useful error-message is desirable. I would suggest this should be handled by regressorsMean() and regressorsVariance(), which are called by both arx() and isat() in creating the regressor matrix. One idea could be to introduce the check after the if(na.omit) part (and before the 'output' part) of the code.

Note that there is a related issue, namely whether character vectors should automatically be converted to dummies (one for each level) as in lm(). Here is an example:

y <- rnorm(20); x <- rnorm(20); z <- letters[1:20]
lm(y ~ x+z)

I think this is really neat and convenient. It would be nice if arx()/isat() could handle character vectors in the same way. Note that matrices of class 'zoo' can be a mix of both numeric and character vectors, so the fact that all data-handling in the 'gets' package relies on 'zoo' objects should not be a limitation. Again, I think the right place to introduce this type of functionality is in regressorsMean() and regressorsVariance(). As an idea, we could add (say) an argument 'factors.as.dummies' with default 'FALSE', which converts a character vector to its associated dummies if set to 'TRUE'. If 'FALSE', then a more meaningful error message could be returned if any of the variables in 'mxreg' (or 'vxreg') are non-numeric.