Closed Satvik closed 4 years ago
I guess it would probably be best to either chose numpy.array
or pandas.DataFrame
as our desired datatype for X and Y in the whole NGBoost
class. This would also be benefical in further issues/additions to the library.
Perhaps we should follow sklearn's approach of using sample_weight for the purposes of subsampling. We currently don't support sample_weight in our fit() method, but we ought to. That way we subsampling can also be implemented in a numpy/pandas agnostic way.
If anyone wants to take a stab, here's how sklearn does it (and ideally how we would want to do it as well): https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/_gb.py#L1239-L1245
I would like to give this PR a shot coming weekend.
Sure, please submit a PR!
Hi, for sure. I made a start at https://github.com/stanfordmlgroup/ngboost/pull/32 but it has some fundamental issues (mainly with subsampling). If you have the time the coming week, feel free to modify the code or start completely from scratch. Otherwise I can do it in the coming weeks.
Op vr 22 nov. 2019 om 21:30 schreef YutaYamazaki notifications@github.com:
Can I work for this issue?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stanfordmlgroup/ngboost/issues/20?email_source=notifications&email_token=AGNRCIQTMR27V6I4CN5PA4DQVA6OBA5CNFSM4JI5ORF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEE6ZCGY#issuecomment-557682971, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGNRCIQO5FJKWBFLG5MHC3DQVA6OBANCNFSM4JI5ORFQ .
-- Met vriendelijke groet,
William van Doorn https://nl.linkedin.com/pub/william-van-doorn/85/ab3/bba
@avati I've now implemented sample weighting but I think that doing batching via sample weights makes the code a little less modular... could be convinced otherwise.
I also think we should be careful about supporting dataframes for the time being because there are other issues that come with them, like dealing with non-numeric features. The user should have to think carefully about how any preprocessing is happening, so if we're supporting dataframes, we should also put in enough input validation so that users don't do crazy things. What's crazy or not may also depend on the base learner- I'm not sure, but I assume sklearn's linear models need numeric predictors, whereas perhaps the regression trees are built to deal with categories in dataframes? And ordinals? I don't know. Much safer overall imo to put the onus on the user to one-hot encode/ordinal-encode/etc. their data into a numpy array.
Given all that, I don't think it makes sense for the time being to support dataframes explicitly, which makes the issue moot. I'm going to close the issue for now, but please feel free to discuss and push back on me and I'll reopen if I'm in the minority.
If X is a pandas dataframe and Y is a series, and minibatch_frac == 1, then
ngb.fit(X, Y)
works with no issues.But if minibatch_frac < 1, then the function
sample
inngboost.py
fails to work on dataframes:Because
X[idxs, :]
is not valid DataFrame syntax.A workaround that works on my machine:
I'm running version 0.1.3, installed from github via pip today, on Mac OS 10.14.13, python version 3.7.4