microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.66k stars 3.83k forks source link

Python gbm.feature_importance() error? #615

Closed vousmevoyez closed 7 years ago

vousmevoyez commented 7 years ago

Environment info

Operating System: Linux CPU: Python version: Python 2.7.13

Error Message:

ValueError: No JSON object could be decoded

Reproducible examples

lgb_train = lgb.Dataset(X_train, y_train) lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train) params = { 'task':'train', 'boosting':'gbdt', 'objective':'binary', 'metric':{'l2', 'auc'}, 'num_leaves': 62, 'learning_rate': 0.05, 'feature_fraction': 0.9, 'bagging_fraction': 0.8, 'bagging_freq': 5, 'verbose': 20 } gbm = lgb.train(params, lgb_train, num_boost_round=250, valid_sets=lgb_eval)

print('Start predicting...')

y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration) y_pred = np.round(y_pred)

print gbm.feature_importance()

wxchan commented 7 years ago

test both python 2 and 3, no error. try latest code

vousmevoyez commented 7 years ago

Still has error. I have tried the latest code. Below is the complete error:

ValueError                                Traceback (most recent call last)
<ipython-input-14-920de1b50449> in <module>()
----> 1 gbm.feature_importance()

/home/admin/anaconda2/lib/python2.7/site-packages/lightgbm-0.2-py2.7.egg/lightgbm/basic.pyc in feature_importance(self, importance_type)
   1662         if importance_type not in ["split", "gain"]:
   1663             raise KeyError("importance_type must be split or gain")
-> 1664         dump_model = self.dump_model()
   1665         ret = [0] * (dump_model["max_feature_idx"] + 1)
   1666 

/home/admin/anaconda2/lib/python2.7/site-packages/lightgbm-0.2-py2.7.egg/lightgbm/basic.pyc in dump_model(self, num_iteration)
   1577                 ctypes.byref(tmp_out_len),
   1578                 ptr_string_buffer))
-> 1579         return json.loads(string_buffer.value.decode())
   1580 
   1581     def predict(self, data, num_iteration=-1, raw_score=False, pred_leaf=False, data_has_header=False, is_reshape=True,

/home/admin/anaconda2/lib/python2.7/json/__init__.pyc in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    337             parse_int is None and parse_float is None and
    338             parse_constant is None and object_pairs_hook is None and not kw):
--> 339         return _default_decoder.decode(s)
    340     if cls is None:
    341         cls = JSONDecoder

/home/admin/anaconda2/lib/python2.7/json/decoder.pyc in decode(self, s, _w)
    362 
    363         """
--> 364         obj, end = self.raw_decode(s, idx=_w(s, 0).end())
    365         end = _w(s, end).end()
    366         if end != len(s):

/home/admin/anaconda2/lib/python2.7/json/decoder.pyc in raw_decode(self, s, idx)
    380             obj, end = self.scan_once(s, idx)
    381         except StopIteration:
--> 382             raise ValueError("No JSON object could be decoded")
    383         return obj, end

ValueError: No JSON object could be decoded
wxchan commented 7 years ago

try set num_boost_round=1 to see if it works.

btw, you should quote your error msg with ```

vousmevoyez commented 7 years ago

It works. But why does this happen?

wxchan commented 7 years ago

feature importances use a string buffer passed from c++ to python, I guess the string buffer for 250 rounds is too long and be cut during passing.

vousmevoyez commented 7 years ago

Sorry. My OS is NOT Windows. It's linux.

wxchan commented 7 years ago

Oh, sorry, misread it.

wxchan commented 7 years ago

Strange I set num_boost_round to 1M and still cannot reproduce it. You can change this line to return string_buffer.value.decode(), set num_boost_round to a big number and save gbm.dump_model() to some files, upload here. We can see if it's been cut.

vousmevoyez commented 7 years ago

I did what you posted. But I can't gbm.dump_model(). It raises error as below. How about gbm.save_model() as txt format?

<ipython-input-17-cf366c50211c> in <module>()
----> 1 gbm.dump_model()

/home/admin/anaconda2/lib/python2.7/site-packages/lightgbm-0.2-py2.7.egg/lightgbm/basic.pyc in dump_model(self, num_iteration)
   1577                 ctypes.byref(tmp_out_len),
   1578                 ptr_string_buffer))
-> 1579         return string_buffer.value.decode()
   1580 
   1581     def predict(self, data, num_iteration=-1, raw_score=False, pred_leaf=False, data_has_header=False, is_reshape=True,

/home/admin/anaconda2/lib/python2.7/json/__init__.pyc in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    337             parse_int is None and parse_float is None and
    338             parse_constant is None and object_pairs_hook is None and not kw):
--> 339         return _default_decoder.decode(s)
    340     if cls is None:
    341         cls = JSONDecoder

/home/admin/anaconda2/lib/python2.7/json/decoder.pyc in decode(self, s, _w)
    362 
    363         """
--> 364         obj, end = self.raw_decode(s, idx=_w(s, 0).end())
    365         end = _w(s, end).end()
    366         if end != len(s):

/home/admin/anaconda2/lib/python2.7/json/decoder.pyc in raw_decode(self, s, idx)
    380             obj, end = self.scan_once(s, idx)
    381         except StopIteration:
--> 382             raise ValueError("No JSON object could be decoded")
    383         return obj, end

ValueError: No JSON object could be decoded
vousmevoyez commented 7 years ago

FYI, my data has 4.6 million rows and 220 columns.

wxchan commented 7 years ago

It's strange, json.loads already removed, why still show "No JSON object could be decoded"? I still try reproduce this issue, need some time.

vousmevoyez commented 7 years ago

I rerun my code today. Error is different:

TypeError                                 Traceback (most recent call last)
<ipython-input-17-6f3b6c156ac1> in <module>()
----> 1 bst.feature_importance()

/home/admin/anaconda2/lib/python2.7/site-packages/lightgbm-0.2-py2.7.egg/lightgbm/basic.pyc in feature_importance(self, importance_type)
   1663             raise KeyError("importance_type must be split or gain")
   1664         dump_model = self.dump_model()
-> 1665         ret = [0] * (dump_model["max_feature_idx"] + 1)
   1666 
   1667         def dfs(root):

TypeError: string indices must be integers
wxchan commented 7 years ago

dump_model() seems work, can you try dump_model() again?

vousmevoyez commented 7 years ago

model.zip

wxchan commented 7 years ago

Strange that model.json seems not been cut. Try this:

import json
json.loads(gbm.dump_model())
vousmevoyez commented 7 years ago
ValueError                                Traceback (most recent call last)
<ipython-input-61-7d5d098ecca5> in <module>()
      1 import json
----> 2 json.loads(gbm.dump_model())

/home/admin/anaconda2/lib/python2.7/json/__init__.pyc in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    337             parse_int is None and parse_float is None and
    338             parse_constant is None and object_pairs_hook is None and not kw):
--> 339         return _default_decoder.decode(s)
    340     if cls is None:
    341         cls = JSONDecoder

/home/admin/anaconda2/lib/python2.7/json/decoder.pyc in decode(self, s, _w)
    362 
    363         """
--> 364         obj, end = self.raw_decode(s, idx=_w(s, 0).end())
    365         end = _w(s, end).end()
    366         if end != len(s):

/home/admin/anaconda2/lib/python2.7/json/decoder.pyc in raw_decode(self, s, idx)
    380             obj, end = self.scan_once(s, idx)
    381         except StopIteration:
--> 382             raise ValueError("No JSON object could be decoded")
    383         return obj, end

ValueError: No JSON object could be decoded
wxchan commented 7 years ago

I think I find out the reason. Can you also save_model() and upload here?

vousmevoyez commented 7 years ago

model_txt.zip

wxchan commented 7 years ago

Thanks for your help. You can change this line https://github.com/Microsoft/LightGBM/blob/master/src/io/tree.cpp#L369 to str_buf << "\"threshold\":" << Common::AvoidInf(threshold_[index]) << "," << std::endl; for temp solution, and change python-package back. The infinite number cannot be handled by json. I will fix this later.

guolinke commented 7 years ago

@wxchan You can fix this line: https://github.com/Microsoft/LightGBM/blob/master/src/io/tree.cpp#L85

wxchan commented 7 years ago

@guolinke add Common::AvoidInf to threshold_double? I will keep the change on L369, it helps when user loads old model.

guolinke commented 7 years ago

@wxchan yes and okay.

vousmevoyez commented 7 years ago

@wxchan Thanks. it works.

wxchan commented 7 years ago

@vousmevoyez you can pip install simplejson, it's more efficient and has better error message.

vousmevoyez commented 7 years ago

OK.

github-actions[bot] commented 1 year ago

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.