microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.69k stars 3.83k forks source link

importance weights and mae/quantile regression #2103

Closed HansN87 closed 5 years ago

HansN87 commented 5 years ago

Environment info

Operating System: pop-os 18.10 (Ubuntu 18.10), kernel 4.18.0-15-generic

CPU/GPU model: Intel(R) Core(TM) i5-8350U CPU @ 1.70GHz

C++/Python/R version: gcc (Ubuntu 8.3.0-6ubuntu1~18.10) 8.3.0, Python 2.7.15+

LightGBM version or commit hash: lightgbm.version = 2.2.3

Error message

no error messages. code runs just fine.

The issue: the quantile regression/mae metric increases (instead of decreases) during boosting if importance weights for the samples are used during training/testing. increase of metric appears not smooth (jumps "discretely" to significantly larger values after several boosting steps with only marginal changes in metric)

If no importance weights are used (i.e. all weights equal to one) then quantile regression/mae metric decreases during boosting as expected.

I'd be grateful for any advice on how to perform quantile regression using LightGBM with weighted data. Thank you!

Reproducible examples

https://github.com/HansN87/LightGBM_quantiles/blob/master/lgbm_snippet.ipynb

update the same example using txt input can be found here https://github.com/HansN87/LightGBM_quantiles/blob/master/lgbm_snippet_csv.ipynb

Steps to reproduce

run the code snippet given above.

guolinke commented 5 years ago

It seems a potential bug. I will investigate it.

guolinke commented 5 years ago

@HansN87 I tried your code, but meet a error when loading data:

>>> df = pd.read_hdf('minimal_dataset.h5')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 394, in read_hdf
    return store.select(key, auto_close=auto_close, **kwargs)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 741, in select
    return it.get_result()
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 1483, in get_result
    results = self.func(self.start, self.stop, where)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 734, in func
    columns=columns)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 2928, in read
    ax = self.read_index('axis%d' % i, start=_start, stop=_stop)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 2523, in read_index
    _, index = self.read_index_node(getattr(self.group, key), **kwargs)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 2653, in read_index_node
    errors=self.errors), **kwargs)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 4563, in _unconvert_index
    errors=errors)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 4656, in _unconvert_string_array
    data = libwriters.string_array_replace_from_nan_rep(data, nan_rep)
  File "pandas\_libs\writers.pyx", line 158, in pandas._libs.writers.string_array_replace_from_nan_rep
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'double'

could you provide a text format data, like csv?

HansN87 commented 5 years ago

@guolinke thanks for your interest and offer to help. I added a txt based notebook: https://github.com/HansN87/LightGBM_quantiles/blob/master/lgbm_snippet_csv.ipynb

HansN87 commented 5 years ago

the problem appears to be related to the occurrence of very small sample weights. introducing a lower-bound on the weights (i.e. removing such samples from the dataset) seems to stabilize the results. I'd expect samples with small weights not to contribute to gradient/hessian calculation. wild guess would be an underflow somewhere.

guolinke commented 5 years ago

@HansN87 there is a bug in your code: as the train_test split ratio is 0.5, the weight in evalerror_w is wrong.

guolinke commented 5 years ago

fix in 8ffd8d80e47e40ed6439123de85ab8bacffc1bb6