Loading a Python trained XGBoost model

dennismphil commented 6 years ago

I am attempting to load a model trained in Python and exported as a JSON dump

Relevant portion of the python code to export to JSON

# Python snippet
# The model was trained using the Python sklearn API of XGBoost
# (https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/sklearn.py).
...
...
mdl_json = mdl.get_booster().get_dump(dump_format = 'json')
with open(
    os.path.join(
       PROJECT_PATH,
        'model.json'
    ),
    'w'
) as handle:
handle.write(json.dumps(mdl_json))

Attempt to load this in Javascript

/* Javascript snippet */

require('ml-xgboost')
    .then(XGBoost => {
        // and load it
        XGBoost.loadFromModel('./model.json');
    }).catch((error) => {
        console.error(error);
    });

yields

[12:08:29] dmlc-core/include/dmlc/./logging.h:300: 
[12:08:29] src/learner.cc:299: Check failed: fi->Read(&name_obj_[0], len) == len 
(73196083 vs. 1851546400) BoostLearner: wrong model format
5356640 - Exception catching is disabled, this exception cannot be caught. 
Compile with -s DISABLE_EXCEPTION_CATCHING=0 or DISABLE_EXCEPTION_CATCHING=2 to catch.

Is this scenario covered in the loadFromModel method?

vatsan commented 6 years ago

@JeffersonH44 What format are you expecting a model trained in another language in? Is it just a bytearray of the pickled XGBoost model or is it a pickle of the Booster? If we've trained the model using the sklearn Python API of XGBoost (https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/sklearn.py), how can we load it in this javascript library?

JeffersonH44 commented 6 years ago

Hello @dennismphil @vatsan,

to save the model using the sklearn API:

# clf is the classifier
clf._Booster.save_model('your-file-name')

with that generated file you are able to load it on the library. For other languages you should check with method generates model that you are able to use in other languages (in this case I fit the C API).

Another thing that you should take in count is that if you are doing a multi-class classification, you should pass a label encoder first over your data (labels) before training, and with the translations you can pass it to our classifier as well, this is because originally this is trained with the probabilities of each class (and they use an internal label encoder that doesn't belong to XGBoost in the case of Python), if you don't do this, the predict method will return the probabilities of each class.

dennismphil commented 6 years ago

This works like a charm. Thank you very much! 🥇The biggest hurdle was understanding what format from Python should we export it. It might be nice to add this two line code in the README so it would be super helpful for people starting to experiment with this library.

mljs / xgboost

Loading a Python trained XGBoost model #5