Closed HansN87 closed 5 years ago
It seems a potential bug. I will investigate it.
@HansN87 I tried your code, but meet a error when loading data:
>>> df = pd.read_hdf('minimal_dataset.h5')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 394, in read_hdf
return store.select(key, auto_close=auto_close, **kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 741, in select
return it.get_result()
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 1483, in get_result
results = self.func(self.start, self.stop, where)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 734, in func
columns=columns)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 2928, in read
ax = self.read_index('axis%d' % i, start=_start, stop=_stop)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 2523, in read_index
_, index = self.read_index_node(getattr(self.group, key), **kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 2653, in read_index_node
errors=self.errors), **kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 4563, in _unconvert_index
errors=errors)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 4656, in _unconvert_string_array
data = libwriters.string_array_replace_from_nan_rep(data, nan_rep)
File "pandas\_libs\writers.pyx", line 158, in pandas._libs.writers.string_array_replace_from_nan_rep
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'double'
could you provide a text format data, like csv?
@guolinke thanks for your interest and offer to help. I added a txt based notebook: https://github.com/HansN87/LightGBM_quantiles/blob/master/lgbm_snippet_csv.ipynb
the problem appears to be related to the occurrence of very small sample weights. introducing a lower-bound on the weights (i.e. removing such samples from the dataset) seems to stabilize the results. I'd expect samples with small weights not to contribute to gradient/hessian calculation. wild guess would be an underflow somewhere.
@HansN87 there is a bug in your code: as the train_test split ratio is 0.5, the weight in evalerror_w is wrong.
fix in 8ffd8d80e47e40ed6439123de85ab8bacffc1bb6
Environment info
Operating System: pop-os 18.10 (Ubuntu 18.10), kernel 4.18.0-15-generic
CPU/GPU model: Intel(R) Core(TM) i5-8350U CPU @ 1.70GHz
C++/Python/R version: gcc (Ubuntu 8.3.0-6ubuntu1~18.10) 8.3.0, Python 2.7.15+
LightGBM version or commit hash: lightgbm.version = 2.2.3
Error message
no error messages. code runs just fine.
The issue: the quantile regression/mae metric increases (instead of decreases) during boosting if importance weights for the samples are used during training/testing. increase of metric appears not smooth (jumps "discretely" to significantly larger values after several boosting steps with only marginal changes in metric)
If no importance weights are used (i.e. all weights equal to one) then quantile regression/mae metric decreases during boosting as expected.
I'd be grateful for any advice on how to perform quantile regression using LightGBM with weighted data. Thank you!
Reproducible examples
https://github.com/HansN87/LightGBM_quantiles/blob/master/lgbm_snippet.ipynb
update the same example using txt input can be found here https://github.com/HansN87/LightGBM_quantiles/blob/master/lgbm_snippet_csv.ipynb
Steps to reproduce
run the code snippet given above.