vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.23k stars 590 forks source link

[FEATURE-REQUEST] Weighted samples with LightGBM #2332

Open dlewis-esure opened 1 year ago

dlewis-esure commented 1 year ago

Hi there,

Been using vaex for some ML work, it's been incredibly useful, so thanks for that! I was wondering whether it would be possible to introduce weights in the LightGBM wrapper?

By the looks of it, I think the most straightforward way would be to add it to the dtrain dataset - LightGBM datasets have a 'weight' parameter that is considered during training, but this parameter isn't currently used in vaex (https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.Dataset.html#lightgbm.Dataset). The vaex.ml.lightgbm.LightGBMModel.fit method could be changed to something like the below snippet, with a weight_col parameter in the method, which is passed to the lightgbm.Dataset:

def fit(self, df, valid_sets=None, valid_names=None, early_stopping_rounds=None, evals_result=None, verbose_eval=None, weight_col=None, **kwargs):
    if weight_col is not None:
        dtrain = lightgbm.Dataset(df[self.features].values, df[self.target].to_numpy(), weight=df[weight_col].to_numpy())
    else:
        dtrain = lightgbm.Dataset(df[self.features].values, df[self.target].to_numpy())
...

Or, I guess, the weight column could be defined at the same time as self.features and self.target, not sure what might be best in your opinion! Hope this makes sense - if it's already been discussed/brought up, I apologise, couldn't see an issue for it, but I might just be blind. Thanks very much 😄