microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.56k stars 3.82k forks source link

[python-package] reset_parameter() segfaults when passing an unrecognized parameter #6479

Open chris-hite-akuna opened 3 months ago

chris-hite-akuna commented 3 months ago

Description

reset_parameter segfaults on bad keys.

Reproducible example

m.reset_parameter({'kelp_var1': 123456789}) [LightGBM] [Warning] Unknown parameter: kelp_var1 Segmentation fault (core dumped)

Environment info

LightGBM version or commit hash: lightgbm 4.3.0 py38h17151c0_0 conda-forge-remote Command(s) you used to install LightGBM

conda upgrade lightgbm

Additional Comments

I'm looking for a nice way to put some user metadata about the training data into the model file, so I can avoid issues with them being used in the wrong context. For example, I'm filtering the training data. I've noticed I can modify the file directly and it does load/save it.

jameslamb commented 3 months ago

Thanks for using LightGBM.

We need more details than this to help you.

And just to set the right expectation... segfaulting should never happen so that part is a bug, but you cannot use reset_parameter() to track arbitrary custom data about a model. You'll have to do that some other way (for example, write out a JSON file next to wherever you store your model).

chris-hite-akuna commented 3 months ago
python --version
Python 3.8.10
uname -a
Linux cof-dev-l501 4.15.0-33-generic #36-Ubuntu SMP Wed Aug 15 16:00:05 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

import lightgbm as lgb
m = lgb.Booster(model_file='m0.txt')  # you'll need some model
type(m)
<class 'lightgbm.basic.Booster'>
m.reset_parameter({'kelp_var1': 123456789})
[LightGBM] [Warning] Unknown parameter: kelp_var1
Segmentation fault (core dumped)

I realize I'm asking for a feature there. Thanks for making it clear it doesn't exist. Yeah, we can work around. If I have a model that should only be used on Tuesdays, I can also add "tuesday" to the filename and try to encode that way. It would just be nice to have it internally in the file. I guess I can always make a feature request.

jameslamb commented 3 months ago

Thanks for that.

I personally would be -1 on the idea of LightGBM supporting storage of arbitrary extra data in model files. That'd add complexity and maintenance burden to this project for, in my opinion, not much value compared to just writing your own data alongside the model.

Write that data to another file and store it alongside the model. If you need to create a single artifact, write multiple files and zip them up in an archive with tar or zip or similar.