refunders / refund

Regression with functional data
39 stars 23 forks source link

naming scheme for functional data in dataframe?? #64

Closed jeff-goldsmith closed 8 years ago

jeff-goldsmith commented 8 years ago

@fabian-s @lxiao16 @philreiss @huangracer

Does anyone know where the naming convention for columns in the ydata argument to fpca.sc comes from? That scheme is .index for the time grid, .id for the subject identifier, and .value for the y value in the curve.

Is that a naming scheme we’d want to keep moving forward? Specifically, I wonder about the “.” that precedes the variable name and the use of “index” rather than “argval”.

Doesn’t make a big difference for fpca.sc now, but I’m developing some other code that uses dataframes (rather than matrices) to store functional data; eventually this may go into refund and I implement a useful naming scheme sooner rather than later.

fabian-s commented 8 years ago

I used this for irregular responses for pffr. It was a terrible decision to do it like this. If I would do it all over again I'd do it something like this:

# generate some irregular functional observations
functions <- replicate(10, {
  length <- sample(5:20, 1)
  argvals <- sort(sample(1:20, length))
  x <- argvals + rnorm(length) 
  structure(x, argvals = argvals, class = "some_new_S3class_for_irregular_functions")
})  

d<- data.frame(id = factor(1:10), some_covariate = rexp(10))
d$irregular_functions <- functions
str(d[1:3,])
#'data.frame':  3 obs. of  3 variables:
# $ id                           : Factor w/ 10 levels "1","2","3","4",..: 1 2 3
# $ some_covariate               : num  0.0645 0.9887 0.1616
# $ irregular_functions:List of 3
#  ..$ :Class 'some_new_S3class_for_irregularfunctions'  atomic [1:17] 1.25 2.6 5.83 3.88 7.27 ...
#  .. .. ..- attr(*, "argvals")= int [1:17] 2 3 4 5 6 7 10 11 12 13 ...
#  ..$ :Class 'some_new_S3class_for_irregularfunctions'  atomic [1:7] 1.98 8.2 8.56 13.96 15.34 ...
#  .. .. ..- attr(*, "argvals")= int [1:7] 2 8 9 12 16 18 20
#  ..$ :Class 'some_new_S3class_for_irregularfunctions'  atomic [1:8] 2.46 10.37 9.4 9.45 13.03 ...
#  .. .. ..- attr(*, "argvals")= int [1:8] 4 9 10 11 13 17 19 20

This would make it possible to avoid having separate data arguments for the covariates and the irregular functional responses in pffr, with all the bookeeping nightmares that causes, and might also be useful for having a unified interface for the fpca.XXX functions where the distinction between gridded and irregular inputs could then be done more easily inside the function instead of having the user use slightly different arguments in the two cases....

is that a naming scheme we’d want to keep moving forward? Specifically, I wonder about the “.” that precedes the variable name and the use of “index” rather than “argval”.

probably not, see above. the leading periods are there to avoid name conflicts with user-defined variables that occur on the model formula etc. personally i don't like argvals because using abbreviations in argument names is not best practice -- (you have to remember the way you abbreviated things which takes cognitive capacity away from more important stuff), but I'm ready to accept argvals as the standard...

EDIT: @ClaraHapp from our group has written a (Github-only, as of now) package funData with S4 data classes for regular and irregular (multivariate) functional data & images -- this might be a nice alternative to use as well, at the cost of another external dependency.

philreiss commented 8 years ago

If we have a better way than the leading periods, etc. AND we can implement it in time for the upcoming CRAN submission, then I’m all for it. I don’t have very strong opinions about which naming scheme to use. But regarding “argvals”: that comes from the fda package. Back in the day, refund was basically just fosr(), which can take fda-package “fd” objects as the input, so it seemed sensible to use “argvals" as Ramsay & co. do.

jeff-goldsmith commented 8 years ago

@philreiss -- i don't know about others, but i wouldn't intend this as a change ahead of the upcoming CRAN submission. i'm working on a separate project and need to figure a good way to store functional data; this is / will be pertinent to refund and refund.shiny eventually, which is why i'm posting here, but it's not really pressing.

@fabian-s -- thanks for the feedback! it seems like what you propose (and what @ClaraHapp has implemented) could handle the case of multiple functional observations on distinct grids, which is a definite plus for pffr and her work in mutlivariate functional data on different grids. for the case of a single functional response, though, it seems trickier to interact with than a data.frame that doesn't contain complete functional observations and argvals as a variable. for example, d above doesn't play nicely with dplyr or ggplot2, and it seems like it would take some effort to extract the argvals across subjects.

it does seem like it would be possible to write a couple of functions to convert between the format you have here and the existing "long" format (i.e. a data.frame with variables id, argval and value), but i'll have to think more about that.

in the short term, i'm happy to remove the leading periods. fabian, do you prefer index to argvals for the reason you cite? or do you have another suggestion?

fabian-s commented 8 years ago

for the case of a single functional response, though, it seems trickier to interact with than a data.frame that doesn't contain complete functional observations and argvals as a variable. for example, d above doesn't play nicely with dplyr or ggplot2, and it seems like it would take some effort to extract the argvals across subjects.

True, you'd definitiely need as.data.frame methods. The big advantage of the above is in settings where you have to keep (multiple) irregular functional and scalar covariates that belong together in the same object. If you're in a "univariate" setting where you only care about one functional covariate in isolation the hassle of non-standard columns in data.frames is probably too high a price to pay....

it does seem like it would be possible to write a couple of functions to convert between the format you have here and the existing "long" format (i.e. a data.frame with variables id, argval and value), but i'll have to think more about that.

that's pretty easy:

# generate some irregular functional observations
test <- replicate(10, {
  length <- sample(5:20, 1)
  argvals <- sort(sample(1:20, length))
  x <- argvals + rnorm(length)
  structure(x, argvals = argvals, class = "irregular_function")
})
# define class for collection of "irregular_function" objects
class(test) <- c("irregular_functions", class(test))

as.data.frame.irregular_function <- function(x, row.names = NULL, optional = FALSE, ...) {
  data.frame(id=NA, argvals=attr(x, "argvals"), value = as.vector(x))
}

test[[1]]
# [1]  2.152043  3.794492  4.048538  4.390470  5.255414  6.427627
# [7]  7.621017  7.879706 10.995162 13.671161 13.539449 13.746053
#[13] 15.722840 17.559681 19.444414
#attr(,"argvals")
# [1]  1  3  4  5  6  7  8  9 12 13 14 15 16 18 20
#attr(,"class")
#[1] "irregular_function"

as.data.frame(test[[1]])[1:3,]
#  id argvals    value
#1 NA       1 2.152043
#2 NA       3 3.794492
#3 NA       4 4.048538

as.data.frame.irregular_functions <- function(x, row.names = NULL, optional = FALSE, id=NULL, ...) {
  # use <id> argument else names of <x> else integer codes for id in return object
  data_list <- lapply(x, as.data.frame)
  if(is.null(id)) {
    if(is.null(names(x))) {
      id <- 1:length(x)
    } else {
      stopifnot(all(!duplicated(names(x))))
      id <- names(x)
    }
  } # else check id arg
  data_list <- lapply(1:length(x), function(i) {
    data_list[[i]]$id <- id[i]
    data_list[[i]]
    }) 
  do.call(rbind, data_list)
}
as.data.frame(test)[76:80, ]
#   id argvals     value
#76  5      18 20.326934
#77  5      19 17.348421
#78  5      20 19.488374
#79  6       1  2.490033
#80  6       4  3.744802

in the short term, i'm happy to remove the leading periods. fabian, do you prefer index to argvals for the reason you cite? or do you have another suggestion?

i'll go along: you do the work, you get to decide. But if you do change it, it would be best to be consistent across the package and also perform the necessary changes for pffr and its methods (and not mess up my ugly hairball of pffr code in the process :wink:)

jeff-goldsmith commented 8 years ago

True, you'd definitiely need as.data.frame methods. The big advantage of the above is in settings where you have to keep (multiple) irregular functional and scalar covariates that belong together in the same object. If you're in a "univariate" setting where you only care about one functional covariate in isolation the hassle of non-standard columns in data.frames is probably too high a price to pay....

Agreed. Since I'm working with univariate data I'll stick with a standard data.frame for now, and keep your as.data.frame function (and one that converts a usual data.frame to a irregular_functions object) in mind for later. We'd need them if you reformat pffr to handle the more complex data structure, and probably for some other cases I'm not thinking of now.

Alright, I'm going to use subj, index, and value without leading periods as variable names. If this does make it into refund (or close proximity to refund) I can try to make the changes to pffr without mucking things up.