greenplum-db / PivotalR-archive

An convenient R tool for manipulating tables in PostgreSQL type databases and a wrapper of Apache MADlib.
https://pivotalsoftware.github.io/gp-r/
125 stars 53 forks source link

generic.cv.R #34

Closed whizzalan closed 9 years ago

whizzalan commented 9 years ago

When I used generic.cv for madlib.elnet, I got some problem, as follow.

x <- matrix(rnorm(100*20),100,20)
y <- rbinom(100,1,0.5)
dat <- data.frame(x, y)
delete("eldata")
z <- as.db.data.frame(dat, "eldata", verbose = FALSE)

g <- generic.cv(
  train = function (data, alpha, lambda) {
    madlib.elnet(y ~ ., data = data, family = "binomial",
                 alpha = alpha, lambda = lambda)
  },
  predict = predict, #function(fit,newdata) {predict(fit,newdata,type="response")} ,
  metric = function (predicted, data) {
    lk(mean((data$y - predicted)^2))
  },
  data = z,
  params = list(alpha=1, lambda=seq(0,0.2,0.1)),
  k = 3, find.min = TRUE)

Computation in-database ... Cutting the data row-wise into 3 pieces ... Running on fold 1 now ... parameters 1, 0 ... parameters 1, 0.1 ... parameters 1, 0.2 ... Running on fold 2 now ... parameters 1, 0 ... parameters 1, 0.1 ... parameters 1, 0.2 ... Running on fold 3 now ... parameters 1, 0 ... parameters 1, 0.1 ... parameters 1, 0.2 ... Done. Fitting the best model using the whole data set ... Executing in database connection 1:

select madlib.elastic_net_train('"eldata"', 'madlib_temp_5d38a1b6_056d_a5aba2_90e7cb4f218a', '("y")::boolean', 'array["X1","X2","X3","X4","X5","X6","X7","X8","X9","X10","X11","X12","X13","X14","X15","X16","X17","X18","X19","X20"]', 'binomial', , , TRUE, NULL, 'fista', 'use_active_set = f', NULL, 100, 1e-04)

walkingsparrow commented 9 years ago

This seems to be a simple bug, where three "," are seen in the problematic query.

whizzalan commented 9 years ago

How do I modify the source code?

fmcquillan99 commented 9 years ago

MADlib dev team will have a look at this and update this thread. Thanks, Frank

whizzalan commented 9 years ago

I think the problem is the function of generic.cv in R not MADlib. Is there any resource of how to modify the source code with R ?

sziegler11-zz commented 9 years ago

In the definition of the metric function, "predicted" should be cast to from boolean to integers via the "as.integer" function. Without this, the database is trying to subtract booleans with numeric values and returning NA.