scikit-learn-contrib / py-earth

A Python implementation of Jerome Friedman's Multivariate Adaptive Regression Splines
http://contrib.scikit-learn.org/py-earth/
BSD 3-Clause "New" or "Revised" License
455 stars 121 forks source link

Support missing data in response matrix, y #148

Open ghost opened 7 years ago

ghost commented 7 years ago

A recent change allows missing data in the design matrix. Might the same be done for the response matrix?

I am adapting EARTH to the time series case, per TSMARS univariate technique documented in http://calhoun.nps.edu/handle/10945/30014.

I am also interested in a multivariate version of TSMARS, and am experimenting with this in using the earth R package. Of course that package does not support NA data at all, at least at present.

jcrudy commented 7 years ago

@empirical-bayesian I'm interested in this idea, but I want to be sure I understand it. Could you explain in a little more detail what your goal is? In the case of a single response column, you could just remove rows with missing responses. Are you thinking, then, about the case of multiple response columns, where a given row might be missing some but not all responses? If so, that is very doable and a good idea. I just never really thought about it. Essentially, the least-squares loss would just not include those NA entries? Would you want to correct for different frequencies of missingness in the different response columns? I suppose that could be done with weighting by the user, but if it's the typical use case then it maybe should be automatic.

Actually, I think you could accomplish everything you want using weighting as it is currently implemented in py-earth. Just give zero weight to the NA entries in your response. That would be mathematically equivalent to what you want, if I've guessed right about what you're proposing. It would be nice to put some functionality in py-earth to zero-out those weights automatically, though.

ghost commented 7 years ago

In field work, responses can just as readily have missing data as corresponding predictors. The treatment of response missing values is a bit different, in general, than that of predictors. Sure, it is not generally the case that the positions or imputed values contribute to analysis, so the weighting of them as zero is proper, and one approach. However, in general, when, say, treating sequential or time-dependent series with Kalman-Rauch-Tung-Striebel filtering, one byproduct of doing analysis of responses with missing data is to provide an estimate for those missing values. These values are not used in the analysis but come from it.

So in the case of EARTH, after an analysis is done, offering (asterisked!) estimates for the missing responses is a reasonable request.

Of course, in my case, when I'm seeking to do TSMARS, the situation is more complicated. That's because the design matrix is itself a set of lagged version of (one of) the responses. So, if a response matrix is the column matrix of [1,2,3,3,4,5,6,7,8,9], the corresponding design matrix of order 4 (meaning, up to 4-lags) is:

 [1,]    1    2    3    4
 [2,]    2    3    4    5
 [3,]    3    4    5    6
 [4,]    4    5    6    7
 [5,]    5    6    7    8
 [6,]    6    7    8    9

and the last 3 elements of the response are disregarded because there's no uncensored meaningful part of the order 4 predictor which corresponds. Unedited, the full design matrix would be

      [,1] [,2] [,3] [,4]
 [1,]    1    2    3    4
 [2,]    2    3    4    5
 [3,]    3    4    5    6
 [4,]    4    5    6    7
 [5,]    5    6    7    8
 [6,]    6    7    8    9
 [7,]    7    8    9   NA
 [8,]    8    9   NA   NA
 [9,]    9   NA   NA   NA

So, in my case, if there was an estimate made of a missing value in the response, say, hypothetically, a response like [1,2,3,NA,5,6,7,8,9], the design matrix would be initially

      [,1] [,2] [,3] [,4]
 [1,]    1    2    3   NA
 [2,]    2    3   NA    5
 [3,]    3   NA    5    6
 [4,]   NA    5    6    7
 [5,]    5    6    7    8
 [6,]    6    7    8    9

I would proposed wrapping an iterate-until-converged loop around the call to EARTH, and feeding that estimate back into the design, and doing it again, until successive changes in the missing response don't change much.

But I understand that is not your problem, that's mine. I'm content with getting estimates out into the response variables.

One can, probably, go forward as much as trying to make estimates of missing predictors, but I have not worked out how to do that with the mathematics of multivariate adaptive regression splines, not yet.

I also am interested in a capability of smoothing hinge points, but that's a different ticket. After all, that kind of thing is what splines are all about.

Thanks!

jcrudy commented 7 years ago

If I understand correctly, the functionality you need from py-earth to accomplish your goal is:

  1. Support missing data in the design matrix
  2. Support missing data in the response matrix
  3. Predict from a fitted model
  4. Smooth hinge points

Is this correct? If so, then I think, if you use weighting to accomplish 2, you have everything. The smoothing can be accomplished by passing smooth=True to the Earth constructor. Am I understanding correctly?

ghost commented 7 years ago

Okay! I'll try it out, and if it works, I'll close the ticket, including an attachment of an example.

Thanks.

jcrudy commented 7 years ago

@empirical-bayesian Great. I'm interested to see how this works for you.