stanfordmlgroup / ngboost

Natural Gradient Boosting for Probabilistic Prediction
Apache License 2.0
1.64k stars 215 forks source link

Implementing left-censoring? #193

Open driesvr opened 3 years ago

driesvr commented 3 years ago

Hi all,

Thank you for creating a very nice package. Are there any plans to implement support for left-censoring? This could be very useful for applications in drug discovery, where assays can typically only test up to a certain concentration, resulting in a lot of censored data.

alejandroschuler commented 3 years ago

That's a good idea. Don't have any plans to work on it at the moment but I'll leave the issue open as a reminder.

driesvr commented 3 years ago

What would be required to implement this? If I have some time I might take a look at it.

alejandroschuler commented 3 years ago

Excellent! The first step is in defining how the score (like a loss) will work for censored data. If you work with the log score (maximum likelihood) and consider right-censoring, the natural approach is to use the censored log-likelihood (eq. 7.7 here). I am certain that equivalents exist that are appropriate for left or interval censoring but I don't know them off the top of my head. It would be nice to work with as general-purpose a formalism as possible so that we could handle right-censoring with the same code that deals with interval censoring, etc.

Once the formalism is established the implementation will likely be straightforward. The first step is to understand how to implement scores for ngboost distributions. That process is detailed in the user guide. The only difference for censored scores is the form of the data structure for the outcome vector. You can see an example of censored score implementations for the exponential distribution. We're not married to these implementation details so if we need something else to make it work in greater generality let's change it.

I'm sure you will have many questions about all aspects of this so please feel free to comment and ask. I'm also happy to set up a call if you want to discuss in greater detail or walk through open-ended questions. You can reach me at alejandro [dot] schuler (gmail address) if you want to set that up. Open to any level of collaboration- could even write a paper about extending ngboost to censored distributions if you're interested.

avati commented 3 years ago

A good starting point might be https://github.com/stanfordmlgroup/ngboost/blob/master/scripts/run_empirical_survival.sh, which currently works for right-censoring.

alejandroschuler commented 3 years ago

@driesvr thanks for the conversation this morning. To recap (and as a reminder for myself) here's what the plan is going forward:

  1. I'll find/derive the correct form of the likelihood for generic censoring structures (i.e. left/right/interval)
  2. I'll implement this form for the log score in the normal distribution and make the required API changes
  3. You guys will take this and run some evaluations internally
  4. We'll reconnect and decide on future scope of work

lmk if I missed anything!

driesvr commented 3 years ago

Hi Alejandro, that sounds perfect! Looking forward to it :)

alejandroschuler commented 3 years ago

Hey @driesvr, just wanted to give you a quick update. I've derived the form of the likelihood for interval data (left and right censoring being interval censoring with interval of the form (-inf, C] and [C, inf), respectively). Presuming conditional independence between the censoring and survival distributions, we have that the likelihood is

image

T here is the survival time and fT is the survival distribution PDF. If the observation is not censored we observe ti and di=1. If it is censored we observe cLi and cRi and di=0. G is a term that does not depend on the parameters of the survival distribution. So maximizing the log-likelihood is the same as maximizing

image

which has gradient

image

So what I need to do is calculate what this gradient is for the case of the normal distribution (or any other distribution) and implement it. The other thing I have to do is change the way user input is handled so that Y is some kind of data structure that contains a single real-valued measurement of T when it is observed and otherwise contains an interval (cL, cR). For convenience, I'll also make a wrapper that hardcodes cR = inf or cL = -inf for the cases or right or left censoring (i.e. the user does not need to be aware that left/right censoring is being generalized to interval censoring). Does that meet your requirements? @avati, thoughts?

driesvr commented 3 years ago

Hi Alejandro, That looks excellent! Looking forward to taking it for a spin once it becomes available.

On Tue, 10 Nov 2020 at 22:43, Alejandro Schuler notifications@github.com wrote:

Hey @driesvr https://github.com/driesvr, just wanted to give you a quick update. I've derived the form of the likelihood for interval data (left and right censoring being interval censoring with interval of the form (-inf, C] and [C, inf), respectively). Presuming conditional independence between the censoring and survival distributions, we have that the likelihood is

[image: image] https://user-images.githubusercontent.com/7213095/98735071-af977e00-2357-11eb-8a67-16479f574d90.png

T here is the survival time and fT is the survival distribution PDF. If the observation is not censored we observe ti and di=1. If it is censored we observe cLi and cRi and di=0. G is a term that does not depend on the parameters of the survival distribution. So maximizing the log-likelihood is the same as maximizing

[image: image] https://user-images.githubusercontent.com/7213095/98735647-94793e00-2358-11eb-8257-0a5e91b3659d.png

which has gradient

[image: image] https://user-images.githubusercontent.com/7213095/98736308-7e1fb200-2359-11eb-93d9-02b959fc544b.png

So what I need to do is calculate what this gradient is for the case of the normal distribution (or any other distribution) and implement it. The other thing I have to do is change the way user input is handled so that Y is some kind of data structure that contains a single real-valued measurement of T when it is observed and otherwise contains an interval (cL, cR). For convenience, I'll also make a wrapper that hardcodes cR = inf or cL = -inf for the cases or right or left censoring (i.e. the user does not need to be aware that left/right censoring is being generalized to interval censoring). Does that meet your requirements? @avati https://github.com/avati, thoughts?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stanfordmlgroup/ngboost/issues/193#issuecomment-724984661, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACPUHO4NP2WSN26LRFFU7V3SPGXZPANCNFSM4SVZLY3Q .

alejandroschuler commented 3 years ago

@driesvr I think I've got a working implementation for the normal. You can use it by pulling this branch. Example code:

import jax.numpy as np
from ngboost.distns import Normal
from ngboost.scores import LogScore, CRPScore
from ngboost.distns.simple_normal import Normal as SimpleNormal

from jax.ops import index_update, index
from ngboost.censored import CensoredOutcome

# helper function to introduce interval, right, or left censoring
# also demonstrates how to construct a CensoredOutcome object
def censor_admin(y, lower=-np.inf, upper=np.inf):
    observed = np.nan*np.zeros(y.shape)
    ix_obs = (y < lower) | (upper < y)

    observed = index_update(observed, index[ix_obs], y[ix_obs])
    censored = np.array([lower, upper])*np.ones((len(y), 2))

    return CensoredOutcome(
        observed, # contains np.nan where observation was censored 
        censored  # for rows where censored, contains the lower and upper bounds of the censoring interval (e.g. [-inf, b] for a row left-censored at b). Rows where `observed` is not np.nan are ignored.
    )  

from ngboost import NGBRegressor, NGBCensored
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

X, Y = load_boston(return_X_y = True)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)

# fake left-censoring (i.e. all observations must be > 20)
Y_train_censored = censor_admin(Y_train, upper=20)

ngb = NGBCensored(Dist=Normal, Score=LogScore).fit(X_train, Y_train_censored)
Y_preds = ngb.predict(X_test)
Y_dists = ngb.pred_dist(X_test)

Please play around with this and let me know where the edges are. Obviously this is experimental and built on top of the new jax implementation of ngboost so there are probably sharp bits. It also runs kind of slowly and I'm not sure yet what other optimizations would help that. Lastly, the user-facing interface is of course subject to some change.

driesvr commented 3 years ago

Thank you for looking into this! We'll take it for a spin in the next few days and we'll let you know how it works, and if we would encounter any issues.

JorisTavernier commented 3 years ago

Thank you for making this extension available.

@alejandroschuler, could it be that in your example, you still have to set the values of the non-censored input to nan in the censored attribute of the CensoredOutcome object inside the _censoradmin functionality?

Currently, the non-censored inputs are evaluated twice (one time as observed and one time as censored) in, for example, the _score evaluation function.

alejandroschuler commented 3 years ago

Thank you for making this extension available.

@alejandroschuler, could it be that in your example, you still have to set the values of the non-censored input to nan in the censored attribute of the CensoredOutcome object inside the _censoradmin functionality?

Currently, the non-censored inputs are evaluated twice (one time as observed and one time as censored) in, for example, the _score evaluation function.

Yes, that's right- sorry, I copied that example from an older version!