[python-package] Training data partially available when loading exported lgb.Dataset

Plenitude-ai commented 6 days ago

Summary

From a binary file exported using the lgb.Dataset.save_binary() method, it is possible to retrieve the feature names (list of string) and the labels (y, as a numpy array) It is not however possible to retrieve the data, as a numpy array. Both the lgb.Dataset.data and lgb.Dataset.get_data() will send back a string with the name of the binary file. It should, in my opinion, send back the numpy array, just as the label. Note it seems as well that the group is also not accessible.

The data is effectively contained in the binary file, because we are able to load and train a booster with it. We should be able to get it from the newly created lgb.Dataset object

Motivation

This would allow us to properly investigate the exported datasets. Like computing some statistics (mean, stdv etc) to understand our dataset, before using it again (or not) to train a booster.

Description

The .data atttribute and/or .get_data() method send back the proper numpy array. The group and/or .get_group() send back the proper list of int

References

I executed the following script on the MQ2008, a microsoft open ranking dataset You should be able to reproduce the same results. I also provide an output so that you can understand what are the atttibutes and method that should have a modified behaviour

import logging

import lightgbm as lgb
from sklearn.datasets import load_svmlight_file

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s : %(message)s",
)
logging.info(f"lgb.__version__ = {lgb.__version__}")

# Function to load group information
def load_group_info(file_path):
    groups = []
    with open(file_path, "r") as f:
        qid_prev = None
        count = 0
        for line in f:
            qid = line.split()[1].split(":")[1]
            if qid != qid_prev:
                if qid_prev is not None:
                    groups.append(count)
                count = 1
                qid_prev = qid
            else:
                count += 1
        groups.append(count)
    return groups

# Load a small sample dataset from sklearn
data = load_svmlight_file("./MQ2008/Fold5/train.txt")
group = load_group_info("./MQ2008/Fold5/train.txt")
X, y = data[0], data[1]
logging.info(f"X.shape {X.shape}")
logging.info(f"y.shape {y.shape}")
logging.info(f"sum(group) {sum(group)}")
logging.info(f"len(group) {len(group)}")

# Create LightGBM dataset
logging.info("Creating ORIGINAL LightGBM dataset :")
lgb_data = lgb.Dataset(X, label=y, group=group, free_raw_data=False)
logging.info(f"lgb_data.data.shape : {lgb_data.data.shape}")
logging.info(f"lgb_data.label.shape : {lgb_data.label.shape}")
logging.info(f"len(lgb_data.group) : {len(lgb_data.group)}")

logging.info("ORIGINAL dataset training ...")
PARAMS = {
    "objective": "lambdarank",
    "metric": "ndcg",
    "eval_at": 5,
    "num_iterations": 5,
    "random_state": 42,
}
lgb.train(PARAMS, lgb_data, valid_sets=[lgb_data], callbacks=[lgb.log_evaluation()])

# Save lgb Dataset
lgb_data.save_binary("lgb_data.bin")

# Load the dataset from the binary file
lgb_data_loaded = lgb.Dataset("lgb_data.bin", free_raw_data=False)
lgb_data_loaded.construct()

# We show that only feature_name & label is accessible
logging.info("\n\n\t\t\tData is not accessible from the saved binary :\n")
logging.info(f"lgb_data_loaded.feature_name) : {lgb_data_loaded.feature_name}")
logging.info(f"lgb_data_loaded.label) : {lgb_data_loaded.label}")
logging.info(f"lgb_data_loaded.label.shape) : {lgb_data_loaded.label.shape}")
logging.info(f"lgb_data_loaded.group) : {lgb_data_loaded.group}")
logging.info(f"lgb_data_loaded.data) : {lgb_data_loaded.data}")
logging.info(f"type(lgb_data_loaded.data)) : {type(lgb_data_loaded.data)}")
logging.info(f"lgb_data_loaded.get_data()) : {lgb_data_loaded.get_data()}")
logging.info(f"type(lgb_data_loaded.get_data())) : {type(lgb_data_loaded.get_data())}")
logging.info(f"lgb_data_loaded.num_data()) : {lgb_data_loaded.num_data()}")
logging.info(f"lgb_data_loaded.num_feature()) : {lgb_data_loaded.num_feature()}")

# Data is Here because it is able to train... Why can't we access it then ?
logging.info(
    "\n\n\t\t\tData is Here because it is able to train... Why can't we access it then ?\n"
)
lgb.train(
    PARAMS,
    lgb_data_loaded,
    valid_sets=[lgb_data_loaded],
    callbacks=[lgb.log_evaluation()],
)

Which outputs :

2024-07-01 13:41:43,996 - INFO : lgb.__version__ = 4.4.0
2024-07-01 13:41:44,145 - INFO : X.shape (9442, 46)
2024-07-01 13:41:44,146 - INFO : y.shape (9442,)
2024-07-01 13:41:44,146 - INFO : sum(group) 9442
2024-07-01 13:41:44,146 - INFO : len(group) 470
2024-07-01 13:41:44,146 - INFO : Creating ORIGINAL LightGBM dataset :
2024-07-01 13:41:44,146 - INFO : lgb_data.data.shape : (9442, 46)
2024-07-01 13:41:44,146 - INFO : lgb_data.label.shape : (9442,)
2024-07-01 13:41:44,146 - INFO : len(lgb_data.group) : 470
2024-07-01 13:41:44,146 - INFO : ORIGINAL dataset training ...
[LightGBM] [Info] Total Bins 9221
[LightGBM] [Info] Number of data points in the train set: 9442, number of used features: 40
[1]     training's ndcg@5: 0.766816
[2]     training's ndcg@5: 0.8209
[3]     training's ndcg@5: 0.838295
[4]     training's ndcg@5: 0.856369
[5]     training's ndcg@5: 0.865225
[LightGBM] [Info] Saving data to binary file lgb_data.bin
[LightGBM] [Info] Load from binary file lgb_data.bin
2024-07-01 13:41:44,725 - INFO : 

                        Data is not accessible from the saved binary :

2024-07-01 13:41:44,726 - INFO : lgb_data_loaded.feature_name) : ['Column_0', 'Column_1', 'Column_2', 'Column_3', 'Column_4', 'Column_5', 'Column_6', 'Column_7', 'Column_8', 'Column_9', 'Column_10', 'Column_11', 'Column_12', 'Column_13', 'Column_14', 'Column_15', 'Column_16', 'Column_17', 'Column_18', 'Column_19', 'Column_20', 'Column_21', 'Column_22', 'Column_23', 'Column_24', 'Column_25', 'Column_26', 'Column_27', 'Column_28', 'Column_29', 'Column_30', 'Column_31', 'Column_32', 'Column_33', 'Column_34', 'Column_35', 'Column_36', 'Column_37', 'Column_38', 'Column_39', 'Column_40', 'Column_41', 'Column_42', 'Column_43', 'Column_44', 'Column_45']
2024-07-01 13:41:44,726 - INFO : lgb_data_loaded.label) : [0. 0. 0. ... 0. 2. 0.]
2024-07-01 13:41:44,726 - INFO : lgb_data_loaded.label.shape) : (9442,)
2024-07-01 13:41:44,726 - INFO : lgb_data_loaded.group) : None
2024-07-01 13:41:44,726 - INFO : lgb_data_loaded.data) : lgb_data.bin
2024-07-01 13:41:44,726 - INFO : type(lgb_data_loaded.data)) : <class 'str'>
2024-07-01 13:41:44,726 - INFO : lgb_data_loaded.get_data()) : lgb_data.bin
2024-07-01 13:41:44,726 - INFO : type(lgb_data_loaded.get_data())) : <class 'str'>
2024-07-01 13:41:44,726 - INFO : lgb_data_loaded.num_data()) : 9442
2024-07-01 13:41:44,726 - INFO : lgb_data_loaded.num_feature()) : 46
2024-07-01 13:41:44,726 - INFO : 

                        Data is Here because it is able to train... Why can't we access it then ?

[LightGBM] [Info] Total Bins 9221
[LightGBM] [Info] Number of data points in the train set: 9442, number of used features: 40
[1]     training's ndcg@5: 0.766816
[2]     training's ndcg@5: 0.8209
[3]     training's ndcg@5: 0.838295
[4]     training's ndcg@5: 0.856369
[5]     training's ndcg@5: 0.865225

jameslamb commented 5 days ago

Thanks for using LightGBM. Someone will help shortly.

I noticed you double-posted this here and to Stack Overflow at the same time (Stack Overflow link). Please do not do that. Maintainers here also monitor the [lightgbm] tag on Stack Overflow. I could have been spending time preparing an answer here while another maintainer was spending time answering your Stack Overflow post, which would have been a waste of maintainers' limited attention that could otherwise have been spent improving this project. Double-posting also makes it less likely that others with a similar question will find the relevant discussion and answer.

Plenitude-ai commented 5 days ago

Hello James, Thank you for your reply, yes of course I didn't have it on my mind I just deleted my stack post. Thank you for your dedication to this amazing library !

jmoralez commented 4 days ago

Hey @Plenitude-ai, thanks for the thorough description. LightGBM's Dataset doesn't save the data as is contained in the original array, it puts features into bins, so the data that is saved is the bin in which the feature value was put into. Is that what you would like to get from lightgbm.Dataset.get_data()? Once the dataset has been saved there's no way to get the original data back.

jameslamb commented 4 days ago

100% agree with everything @jmoralez said.

I'll add that there is an open feature request (#5191) for being able to dump out LightGBM's binned representation as an array, which would allow you to at least partially inspect the training data.

You could subscribe to notifications there to be notified if that feature is formally added to the library. And could try some of the workarounds like https://github.com/microsoft/LightGBM/issues/5191#issuecomment-1742263175 mentioned there.

But only do that if there are genuine constraints that lead your application to only having access to a LightGBM Dataset and not the underlying data. If you can store the raw training data alongside the LightGBM Dataset (e.g. in Parquet, pickle, or npy format), you'll find that much easier and more useful than any of the workarounds described in #5191.

Plenitude-ai commented 4 days ago

I understand. It seems my comprehension of both the fundamental implementation, and purpose/objective of this class is incomplete (it's more low-level than I thought), thanks for pointing me in the correct direction. I'll stick with pkl representations then, but I found useful to have all (array, label, group) in the same object. Yes the issue your referred me to seems to be interesting and I'll definitely subscribe! May I ask why we can still access the features names' & labels ? How is it saved in the bin representation? Thanks again for your time :)

jameslamb commented 4 days ago

May I ask why we can still access the features names' & labels ? How is it saved in the bin representation?

LightGBM needs the exact values of the label (after light preprocessing like handling infinite values and NaNs) to calculate the loss, so it's always recoverable as a dense array from the Dataset object.

No preprocessing is done on feature names, so those also are always recoverable in their original form.

These data structures are almost always smaller than the raw features (and often MUCH smaller).

If you're interested in the lower-level details, I encourage you to look at the source code for the Dataset:

Plenitude-ai commented 3 days ago

That is interesting. I went to look at the source code, but I have to say it's a bit above my coding experience as I don't know C/C++, it is a bit hard to understand how/where the numpy array is converted into bins. I went back to reading further more the documentation and now realize that I messed things up between "bin" and ".bin"/binary. I didn't know about the bin representation of the data, for memory optimization I also found that this phrase in the FAQ was very englightening : "LightGBM constructs bin mappers to build trees, and train and valid Datasets within one Booster share the same bin mappers, categorical features and feature names etc., the Dataset objects are constructed when constructing a Booster. If you set free_raw_data=True (default), the raw data (with Python data struct) will be freed." I'm thinking it might be interesting to add this short explanation in the header of the lgb.Dataset documentation, which was my first refering point. Do you think it could be helpful ?

microsoft / LightGBM