microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.61k stars 3.83k forks source link

[python-package] How do I use lgb.Dataset() with lgb.Predict() without using pandas df or np array? #6285

Closed wil70 closed 2 months ago

wil70 commented 9 months ago

Description

I'm trying optura and flaml. I'm able to train (lgb.train) models with optura with csv and bin files as input for training and validation dataset. This is great as the speed is good. The problem is with the prediction (lgb.predict), I'm not able to get a good speed as I need to go via pandas df or np array. Is there a way to by pass those and use lgb.Dataset()?

Reproducible example

I have big datasets (csv and bin). I would like to use those with lgb.Dataset('train.csv.bin') instead of Panda df pd.read_csv('train.csv') for 1) speed reason and also 2) for consistency on how the LightGBM (CLI version) handle "na" and "+-inf" which pandas handle differently.


params = {
        "objective": "multiclass",
        #"metric": "multi_logloss,multi_error,auc_mu",
        "metric": "multi_error",
        "verbosity": -1,
        "boosting_type": "gbdt",
        "num_threads" : "10",
        "num_class" : "2",
        "ignore_column" : "1",
        "label_column" : "10",
        "categorical_feature":"8,9",
        "data" : 'train.csv.bin',
        "valid_data" : 'validate.csv.bin',
    }

    #model = lgb.train(
    #    params,
    #    dtrain,
    #    valid_sets=[dval],
    #   callbacks=[early_stopping(1), log_evaluation(100)],
    #)

    model.save_model("model.txt")

    #dval = lgb.Dataset('test.csv')
    dval = lgb.Dataset('validate.csv.bin', label=-1) #, params=params)
    #val_data = pd.read_csv('validate.csv',header=None) 

    # Load the model from file
    model = lgb.Booster(model_file='model.txt')

    # Get the true labels
    y_true = dval.get_label()

    # Get the predicted probabilities
    y_pred = model.predict(dval.get_data())
    # **Error: Exception: Cannot get data before construct Dataset**
    #y_pred = model.predict(dval.data)
    #**Error: lightgbm.basic.LightGBMError: Unknown format of training data. Only CSV, TSV, and LibSVM (zero-based) formatted text files are supported.**

How can I achieve this? how do I specify all columns are features except column 10 and ignore column 1? I tried to feed the param to lgb.Dataset, but that didn't do it

Environment info

Win10 pro + Python 3.12.0 + latest optura

LightGBM version or commit hash: Latest as of today

Command(s) you used to install LightGBM

pip install lightgbm

Additional Comments

wil70 commented 8 months ago

If no reply to the "question" then may be this is a feature enhancement request?

This would be a great feature enhacement for large data set. LightGBM is good at handling big dataset for training and validation with its c++ engine, keeping the same performance for the testing phase as well would be a big plus.

In my code, all is good until after the line "model = lgb.Booster(model_file='model.txt')"... If we could directly use a LightGBM Dataset to predict from the model (moel.predict(...)) that would solve the issue as all the data would stay within the c++ engine and not be manipulated in python.

jameslamb commented 8 months ago

Thanks as always for your interest in LightGBM and for pushing the limits of what it can do with larger datasets and larger models.

As you've discovered, directly calling predict() on a LightGBM dataset isn't supported today. We already have these feature requests tracking it (in #2302):

The best way to get that functionality into LightGBM is to contribute it yourself. If that interests you, consider putting up a draft pull request and @-ing us for help on specific questions.

jameslamb commented 8 months ago

Panda df pd.read_csv('train.csv')

If you have large enough data that it's a significant runtime + memory problem to load it, and you're using Python, consider storing it in a different format than a CSV file. CSV is a text format and pandas is going to be doing a ton of type-guessing and type-conversion while reading that.

For example, consider storing it as a dense numpy array in the .npy file format (numpy docs) and then reading it up into a numpy matrix.

Or in Parquet format and reading that into pandas (to at least eliminate most of the type-conversion overhead of CSV).

jameslamb commented 8 months ago

stay within the C++ engine and not be manipulated in Python

LightGBM also supports predicting directly on a CSV file

https://github.com/microsoft/LightGBM/blob/252828fd86627d7405021c3377534d6a8239dd69/src/c_api.cpp#L695

Have you tried that?

You could do that with the lightgbm CLI or using Booster.predict() in the Python package. Booster.predict() accepts a path to a CSV/TSV/LibSVM formatted file.

https://github.com/microsoft/LightGBM/blob/252828fd86627d7405021c3377534d6a8239dd69/python-package/lightgbm/basic.py#L1073-1075

StrikerRUS commented 2 months ago

Closed in favor of being in https://github.com/microsoft/LightGBM/issues/2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.