Closed meh2135 closed 3 months ago
Thanks for using LightGBM.
My best guess is that there is some sorting by covariates happening at each boosting stage that doesn't apply to my hackily tacked on dataset attributes
I'm confident that LightGBM does not shuffle the data. Row i
in the Dataset
you create refers to the same observation as row i
in the provided raw data.
something is going wrong during fitting
I don't totally understand what you're trying to do here... this is a very large amount of provided code, and it's not clear to me which parts I should be looking at or what "something is going wrong during fitting" means specifically. If you could provide a much smaller example of what you're trying to accomplish, it'd help.
But even without that, I think I can still provide some useful resources.
First, see this example of using a custom objective function in the Python package: https://github.com/microsoft/LightGBM/blob/fe69fa9f43f6fc051730e65008184658c991106e/examples/python-guide/advanced_example.py#L139-L147. That's a classification example, but the interface is identical... you provide a function which, at each iteration, is given two things:
preds
= predictions (in terms of the objective function, without transformation) for the ensemble so fartrain_data
= the lightgbm.Dataset
object you're training onand which is responsible for computing gradients and hessians.
It sounds to me like you want to be able to access other data inside such a custom objective function which is not features in the Dataset
. I can think of a few options for doing that:
# option 1: make objective a callable class, store that other data on the class
class CustomObjective:
def __init__(self, time: np.ndarray):
self.time = time
def __call__(self, preds: np.ndarray, train_data: lightgbm.Dataset) -> Tuple[np.ndarray, np.ndarray]:
# your code that calculates the gradient and hessian
grad, hess = your_custom_code(preds, train_data, self.time)
return grad, hess
# option 2: store arbitrary data on the `Dataset` object itself
dtrain = lgb.Dataset(data=X, label=y, ...)
dtrain.time = time
def custom_objective(preds: np.ndarray, train_data: lightgbm.Dataset) -> Tuple[np.ndarray, np.ndarray]:
# your code that calculates the gradient and hessian
grad, hess = your_custom_code(preds, train_data, train_data.time)
return grad, hess
lgb.train(train_set=dtrain, params={"objective": custom_objective}, ...)
See the note on writing custom objective functions at https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.train.html#lightgbm.train.
Hi @jameslamb thanks for the quick response, and apologies for the verbose code blob and ambiguous question... I was on a flight managing my screaming 1.5 yr old at the time, and just wasn't thinking. Very kind of you to answer in spite of that.
Your response is definitely helpful. Seems like I'm generally on the right track.
By "something is going wrong during training," I meant that when I give it the ground truth location params and a well behaved conditional scale function, it produces very poor results.
Given your input and some further investigation, my guess is that it boils down to either the Dataset.construct method somehow wiping out the .loc attribute I attached, or something squirrelly going on with the eval function.
I think my next move is to just drop the sklearn wrapper, which is only complicating things. I'll post more here once I have it either solved or when I hit another block.
In case you're interested, the relevant chunks of the code above are Redefinition of dataset where I set .loc and .log_scale
def __init__(self, data, label=None, reference=None,
weight=None, group=None, init_score=None, silent='warn',
loc = None, log_scale=None,
feature_name='auto', categorical_feature='auto', params=None,
free_raw_data=True):
...
self.loc = loc
...
This chunk of _CustomObjectiveFunctionWrapper (and something equivalent in eval function wrapper) where I pass the particular args I'm looking for on to the objective function as kwargs.
params = signature(self.func).parameters
argc = len(params)
extra_params = {}
if "loc" in params:
extra_params["loc"] = dataset.get_loc()
if "log_scale" in params:
extra_params["log_scale"] = dataset.get_log_scale()
# print("extra_params of wrapper", extra_params)
argc = argc - len(extra_params)
if argc == 2:
grad, hess = self.func(labels, preds, **extra_params)
elif argc == 3:
grad, hess = self.func(labels, preds, dataset.get_group(), **extra_params)
else:
raise TypeError(f"Self-defined objective function should have 2 or 3 arguments, got {argc}")
And finally the redefinition of _construct_dataset where I include the loc and log_scale args.
def _construct_dataset(X, y, sample_weight, init_score, group, params,
categorical_feature='auto', loc=loc, log_scale=log_scale):
return Dataset(X, label=y, weight=sample_weight, group=group,
init_score=init_score, params=params,
categorical_feature=categorical_feature, loc=loc, log_scale=log_scale)
Ha no problem at all! I'm happy to help... although I personally don't know anything much about poisson process regression, so I can only help in the "here's how LightGBM works" sense.
I definitely recommend trying to make this work using the lightgbm.train()
interface (instead of the scikit-learn estimators) first, as you mentioned. That'll help you narrow things down a lot, and should make it possible to implement what you want using one of the approaches I mentioned with 0 changes to lightgbm
's internals... since a custom objective function passed to lightgbm.train()
will be given the Dataset
object directly (instead of the extracted predictions, labels, and weights as numpy
arrays).
I'm going to close this as there hasn't been a new comment in the last year. Come back any time if you need more help or have other questions 👋🏻
Summary
Forward columns of your dataset directly to a custom objective function.
Motivation
This is useful for semi-parametric models like poisson process regression where y | x, t ~ Poisson(mu(x) * t) and we wish to learn mu(), or a variance regression where mean is known, ex: y | x, m ~ Logistic(m, exp(s(x)) and we wish to learn s().
It's possible that lightgbm would pick up on that model if you were to just feed it those parameters as features with no additional context, but consider the case where you're using the poisson process regression for causal inference, and the the time intervals are correlated with treatment status. Including them naively could lead to unpredictable biases.
Description
My optimal intferface would look something like this
I recognize that this is probably not widely useful enough to add as a feature to main, and might need some custom work for each use case, but I'm really struggling to make it work for my case. I've tried modifying the Dataset, the objective function wrapper, eval function wrapper, and inheriting from LGBMRegressor modifying the fit method to look for
loc
andlog_scale
parameters, as below, but something seems to be going wrong during fitting. My best guess is that there is some sorting by covariates happening at each boosting stage that doesn't apply to my hackily tacked on dataset attributes, so they end up misaligned with predictions and labels. I'm hoping that doesn't mean it requires digging into the c++...I would REALLY appreciate anyones advice on how to do this.