stanfordmlgroup / ngboost

Natural Gradient Boosting for Probabilistic Prediction
Apache License 2.0
1.65k stars 215 forks source link

minibatch_frac < 1 does not work if X is a dataframe #20

Closed Satvik closed 4 years ago

Satvik commented 4 years ago

If X is a pandas dataframe and Y is a series, and minibatch_frac == 1, then ngb.fit(X, Y) works with no issues.

But if minibatch_frac < 1, then the function sample in ngboost.py fails to work on dataframes:

    def sample(self, X, Y, params):
        if self.minibatch_frac == 1.0:
            return np.arange(len(Y)), X, Y, params
        sample_size = int(self.minibatch_frac * len(Y))
        idxs = np_rnd.choice(np.arange(len(Y)), sample_size, replace=False)
        return idxs, X[idxs,:], Y[idxs], params[idxs, :]

Because X[idxs, :] is not valid DataFrame syntax.

A workaround that works on my machine:

    def sample(self, X, Y, params):
        if self.minibatch_frac == 1.0:
            return np.arange(len(Y)), X, Y, params
        sample_size = int(self.minibatch_frac * len(Y))
        idxs = np_rnd.choice(np.arange(len(Y)), sample_size, replace=False)
        try:
              X_batch = X[idxs,:]
        except TypeError:
              X_batch = X.iloc[idxs, :]

        return idxs, X[idxs,:], Y[idxs], params[idxs, :]

I'm running version 0.1.3, installed from github via pip today, on Mac OS 10.14.13, python version 3.7.4

wptmdoorn commented 4 years ago

I guess it would probably be best to either chose numpy.array or pandas.DataFrame as our desired datatype for X and Y in the whole NGBoost class. This would also be benefical in further issues/additions to the library.

avati commented 4 years ago

Perhaps we should follow sklearn's approach of using sample_weight for the purposes of subsampling. We currently don't support sample_weight in our fit() method, but we ought to. That way we subsampling can also be implemented in a numpy/pandas agnostic way.

If anyone wants to take a stab, here's how sklearn does it (and ideally how we would want to do it as well): https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/_gb.py#L1239-L1245

wptmdoorn commented 4 years ago

I would like to give this PR a shot coming weekend.

avati commented 4 years ago

Sure, please submit a PR!

wptmdoorn commented 4 years ago

Hi, for sure. I made a start at https://github.com/stanfordmlgroup/ngboost/pull/32 but it has some fundamental issues (mainly with subsampling). If you have the time the coming week, feel free to modify the code or start completely from scratch. Otherwise I can do it in the coming weeks.

Op vr 22 nov. 2019 om 21:30 schreef YutaYamazaki notifications@github.com:

Can I work for this issue?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stanfordmlgroup/ngboost/issues/20?email_source=notifications&email_token=AGNRCIQTMR27V6I4CN5PA4DQVA6OBA5CNFSM4JI5ORF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEE6ZCGY#issuecomment-557682971, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGNRCIQO5FJKWBFLG5MHC3DQVA6OBANCNFSM4JI5ORFQ .

-- Met vriendelijke groet,

William van Doorn https://nl.linkedin.com/pub/william-van-doorn/85/ab3/bba

alejandroschuler commented 4 years ago

@avati I've now implemented sample weighting but I think that doing batching via sample weights makes the code a little less modular... could be convinced otherwise.

I also think we should be careful about supporting dataframes for the time being because there are other issues that come with them, like dealing with non-numeric features. The user should have to think carefully about how any preprocessing is happening, so if we're supporting dataframes, we should also put in enough input validation so that users don't do crazy things. What's crazy or not may also depend on the base learner- I'm not sure, but I assume sklearn's linear models need numeric predictors, whereas perhaps the regression trees are built to deal with categories in dataframes? And ordinals? I don't know. Much safer overall imo to put the onus on the user to one-hot encode/ordinal-encode/etc. their data into a numpy array.

Given all that, I don't think it makes sense for the time being to support dataframes explicitly, which makes the issue moot. I'm going to close the issue for now, but please feel free to discuss and push back on me and I'll reopen if I'm in the minority.