microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.53k stars 3.82k forks source link

lightgbm.basic.LightGBMError: Label 72 is not less than the number of label mappings (31) #4808

Open sofiavlachou28 opened 2 years ago

sofiavlachou28 commented 2 years ago

Hello to everyone!!

I am new to Python and Iam getting this error when running LightGBM about a Ranking problem: lightgbm.basic.LightGBMError: Label 72 is not less than the number of label mappings (31)

I tried to search for this error, could not find much useful resources.

I cant guess where the error occurs. Μy dataset consists of 4 columns: ["Frequency","Comments", "Likes", "Nwords"] as seen below.

# 1) Load Dependencies
import pandas as pd
from numpy import where
import matplotlib.pyplot as plt
import numpy as np
from numpy import unique
from sklearn import metrics
import warnings
warnings.filterwarnings('ignore')
import lightgbm as lgb
gbm = lgb.LGBMRanker()

# 2) Load the Data
# Define Columns
names = ["Frequency","Comments", "Likes", "Nwords"]]

data = pd.read_csv("Posts.csv", encoding="utf-8", sep=";", delimiter=None,
                 names=names, delim_whitespace=False,
                 nrows=181,header=0, engine="python")
X = data.values[:,0:2]
y = data.values[:,3]

# 3) Define the Training Data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=1)

query_train = [X_train.shape[0]]
query_val = [X_val.shape[0]]
query_test = [X_test.shape[0]]
# label_values = [label_gain(1,5)]

gbm.fit(X_train, y_train, group=query_train,
        eval_set=[(X_val, y_val)], eval_group=[query_val], #values=[label_gain(1,5)],
        eval_metric=["ndcg"],eval_at=[1, 2, 3,4,5], early_stopping_rounds=10)

# make predictions
test_pred = gbm.predict(X_test)
X_test["predicted_ranking"] = test_pred
X_test.sort_values("predicted_ranking", ascending=False)

Can anyone help me??

Thank you in advance !!

Sofia

jameslamb commented 2 years ago

Hi @sofiavlachou28 , thanks very much for using LightGBM!

I'd be happy to help you, but we need a little more information.

  1. What version of lightgbm are you using and how did you install it?
  2. Are you able to provide access to the raw data ("Posts.csv"), or to replicate this problem using randomly-created data?
guolinke commented 2 years ago

@sofiavlachou28 please refer to parameter label_gain

sofiavlachou28 commented 2 years ago

Hi @sofiavlachou28 , thanks very much for using LightGBM!

I'd be happy to help you, but we need a little more information.

  1. What version of lightgbm are you using and how did you install it?
  2. Are you able to provide access to the raw data ("Posts.csv"), or to replicate this problem using randomly-created data?

@jameslamb I use the Version 3.3.1. Also, I uploaded my csv dataset for more understaning. Posts.csv

Thanks for your time!! :)

sofiavlachou28 commented 2 years ago

@jameslamb Also, I saw that lightgbm requires: numpy, scipy, scikit-learn, wheel. I use the latest versions of numpy, scipy, sklearn. But when I try to upgrade the "weel" a new error occures:

(venv) C:\Users\USER\pythonProject>pip install wheel==0.37.0
Collecting wheel==0.37.0
  Downloading wheel-0.37.0-py2.py3-none-any.whl (35 kB)
Installing collected packages: wheel
  Attempting uninstall: wheel
    Found existing installation: wheel 0.36.2
    Not uninstalling wheel at c:\users\user\appdata\local\programs\python\python37\lib\site-packages, outside environment c:\users\user\pythonproject\venv
    Can't uninstall 'wheel'. No files were found to uninstall.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following de
pendency conflicts.
tf-nightly-gpu 2.5.0.dev20210223 requires gast==0.4.0, but you have gast 0.3.3 which is incompatible.
tf-nightly-gpu 2.5.0.dev20210223 requires grpcio~=1.34.0, but you have grpcio 1.32.0 which is incompatible.
tf-nightly-gpu 2.5.0.dev20210223 requires h5py~=3.1.0, but you have h5py 2.10.0 which is incompatible.
tf-nightly-gpu 2.5.0.dev20210223 requires six~=1.15.0, but you have six 1.16.0 which is incompatible.
tf-nightly-gpu 2.5.0.dev20210223 requires typing-extensions~=3.7.4, but you have typing-extensions 3.10.0.0 which is incompatible.
tensorflow 2.4.1 requires six~=1.15.0, but you have six 1.16.0 which is incompatible.
tensorflow 2.4.1 requires typing-extensions~=3.7.4, but you have typing-extensions 3.10.0.0 which is incompatible.
tensorflow-gpu 2.4.1 requires six~=1.15.0, but you have six 1.16.0 which is incompatible.
tensorflow-gpu 2.4.1 requires typing-extensions~=3.7.4, but you have typing-extensions 3.10.0.0 which is incompatible.
Successfully installed wheel-0.37.0

(venv) C:\Users\USER\pythonProject>

I dont understand what happens again... 
Any idea? 

Thanks a lot..!
shiyu1994 commented 2 years ago

@sofiavlachou28 Thanks for using LightGBM! Ranking objectives in LightGBM use label_gain_ to store the gain of each label value. By default, label_gain_[i] = (1 << i) - 1. So the default label gain only works with a maximum label value 31. It seems that your dataset contains label values greater than 31. So please specify your customized label_gain, as @guolinke mentioned.

sofiavlachou28 commented 2 years ago

@sofiavlachou28 Thanks for using LightGBM! Ranking objectives in LightGBM use label_gain_ to store the gain of each label value. By default, label_gain_[i] = (1 << i) - 1. So the default label gain only works with a maximum label value 31. It seems that your dataset contains label values greater than 31. So please specify your customized label_gain, as @guolinke mentioned.

Okay. I will try it with this parameter. I hope it works..! Τhanks for your help!

jameslamb commented 2 years ago

Thanks for providing the dataset @sofiavlachou28 .

Looking at the data, I have some observations and a suggestion.

The label for a learning-to-rank problem is expected to be a "relevance score", explaining how relevant one document is compared to another. (see this Stack Overflow answer for a concise explanation).

If you set label_gain to the maximum value in y_train as suggested in https://github.com/microsoft/LightGBM/issues/4808#issuecomment-973991963, model training might run without throwing any errors, but (making some assumptions about your data based on column names), I don't think the model generated will be what you intended.

It seems you're using column nwords for the label, which I assume is "number of words in the post". If you want to use LightGBM to predict the number of words in a document based on how popular it was (likes, comments), I recommend treating that as a regression problem and using LGBMRegressor, not LGBMRanker.

One other suggestion...I noticed the full dataset has only 91 rows, even before holding out some data for validation.

sample code (click me) ```python import lightgbm as lgb import pandas as pd data_url = "https://github.com/microsoft/LightGBM/files/7569237/Posts.csv" feature_names = ["Frequency","Comments", "Likes", "Nwords"] df = pd.read_csv( filepath_or_buffer=data_url, delimiter=";", encoding="utf-8", names=feature_names, delim_whitespace=False, header=0 ) df.shape ```

LightGBM has a few parameters to limit model complexity, whose defaults are set to work well with medium-sized datasets (1000s of observations). If you want LightGBM to learn from 92 observations, consider setting a very small value (like 2) for parameter min_data_in_leaf (link).

sofiavlachou28 commented 2 years ago

Thanks for providing the dataset @sofiavlachou28 .

Looking at the data, I have some observations and a suggestion.

The label for a learning-to-rank problem is expected to be a "relevance score", explaining how relevant one document is compared to another. (see this Stack Overflow answer for a concise explanation).

If you set label_gain to the maximum value in y_train as suggested in #4808 (comment), model training might run without throwing any errors, but (making some assumptions about your data based on column names), I don't think the model generated will be what you intended.

It seems you're using column nwords for the label, which I assume is "number of words in the post". If you want to use LightGBM to predict the number of words in a document based on how popular it was (likes, comments), I recommend treating that as a regression problem and using LGBMRegressor, not LGBMRanker.

One other suggestion...I noticed the full dataset has only 91 rows, even before holding out some data for validation.

sample code (click me) LightGBM has a few parameters to limit model complexity, whose defaults are set to work well with medium-sized datasets (1000s of observations). If you want LightGBM to learn from 92 observations, consider setting a very small value (like 2) for parameter min_data_in_leaf (link).

Τhank you so much for your reply! it is very helpful! I will look at your observations more carefully and I will test my data again, as you suggest. I hope it works. If not, I will open the topic again!

Have a good day! Sofia

no-response[bot] commented 2 years ago

This issue has been automatically closed because it has been awaiting a response for too long. When you have time to to work with the maintainers to resolve this issue, please post a new comment and it will be re-opened. If the issue has been locked for editing by the time you return to it, please open a new issue and reference this one. Thank you for taking the time to improve LightGBM!

sofiavlachou28 commented 2 years ago

Hello to everyone!

After some research and a few hours of hacking, my code is still not working. I do not find anywhere information on how to set the label_gain as you suggested above!

_My task is to find the most popular product/s each time based on likes, comments, or frequency and so on, and I want to do this with Ranking.

Here is my code! I am new to Python ! Can anyone help me??

# 1) Load Dependencies
import pandas as pd
import numpy as np
from numpy import unique
from sklearn import metrics
import warnings
warnings.filterwarnings('ignore')

import lightgbm as lgb

# Core Model
gbm = lgb.LGBMRanker(objective="lambdarank", )

# 2) Load the Data
# Define Columns
names = ["memespostsfrequency","comments","likes","nwords"]

data = pd.read_csv("InstaPosts.csv", encoding="utf-8", sep=";", delimiter=None,
                 names=names, delim_whitespace=False,
                 nrows=181,header=0, engine="python")
X = data.values[:,0:2]
y = data.values[:,3]

# 3) Define the Training Data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=1)

query_train = [X_train.shape[0]]
query_val = [X_val.shape[0]]
query_test = [X_test.shape[0]]

# 4) Model Fit
gbm.fit(X_train,
        y_train,
        group=query_train,
        eval_set=[(X_val, y_val)],
        eval_group=[query_val], #values=[label_gain(1,5)],
        eval_metric=["ndcg"],
        eval_at=[3],
        early_stopping_rounds=10)

# 5) Predictions
test_pred = gbm.predict(X_test)
X_test["predicted_ranking"] = test_pred
X_test.sort_values("predicted_ranking", ascending=False)

And here is the bug:

C:\Users\USER\pythonProject\venv\Scripts\python.exe "C:/Users/USER/pythonProject/Memes LightGBM Algorithm.py"
[LightGBM] [Fatal] Label 72 is not less than the number of label mappings (31)
Traceback (most recent call last):
  File "C:/Users/USER/pythonProject/Memes LightGBM Algorithm.py", line 87, in <module>
    early_stopping_rounds=10)
  File "C:\Users\USER\pythonProject\venv\lib\site-packages\lightgbm\sklearn.py", line 1071, in fit
    categorical_feature=categorical_feature, callbacks=callbacks, init_model=init_model)
  File "C:\Users\USER\pythonProject\venv\lib\site-packages\lightgbm\sklearn.py", line 758, in fit
    callbacks=callbacks
  File "C:\Users\USER\pythonProject\venv\lib\site-packages\lightgbm\engine.py", line 271, in train
    booster = Booster(params=params, train_set=train_set)
  File "C:\Users\USER\pythonProject\venv\lib\site-packages\lightgbm\basic.py", line 2613, in __init__
    ctypes.byref(self.handle)))
  File "C:\Users\USER\pythonProject\venv\lib\site-packages\lightgbm\basic.py", line 125, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: Label 72 is not less than the number of label mappings (31)

Process finished with exit code 1

Thank you in advance!! Sofia V.

thaisalmeida commented 2 years ago

I had the same problem with label mappings when I tried to use LGBMRanker with Optuna. I did them work well together by following this example. As suggested, I set the label_gain parameter as:

[i for i in range(max(y_train.max(), y_valid.max()) + 1)]

I hope it can also help you!

Best,

sofiavlachou28 commented 2 years ago

I had the same problem with label mappings when I tried to use LGBMRanker with Optuna. I did them work well together by following this example. As suggested, I set the label_gain parameter as:

[i for i in range(max(y_train.max(), y_valid.max()) + 1)]

I hope it can also help you!

Best,

@thaisalmeida Thanks for your reply!

I don't understand how to set the label_gain in my code... still raises an error:lightgbm.basic.LightGBMError: Label 47 is not less than the number of label mappings (31)

Can share your code ? or if you don't want, can you help me to set this parameter in my code below?

Thank you in advance :) !!

Here is my New code:

# Dependencies
import pandas as pd
from pandas import set_option
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import RepeatedStratifiedKFold
from pandas import read_csv
import numpy as np
from numpy import unique
from sklearn import metrics
# LGBMRanker
import lightgbm as lgb
from lightgbm import LGBMRanker
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

# Load data
names = ["label","id","comments","likes","product 1 frequency","product 2 frequency"]
dataset = pd.read_csv("Ranking problem.csv", names=names, encoding="utf-8", error_bad_lines=True,
                       skip_blank_lines=True, sep=",", delimiter=None, doublequote=True, keep_default_na=True,
                       nrows=1223, header=6, engine="python")

# Shape
print(dataset.shape)

# Max labels
max_label = dataset.label.nunique()
print(max_label)

# Core Model
gbm = lgb.LGBMRanker(objective="lambdarank", )

# label_gain=np.arange(1, max_label+1)

# Split the data in train and test
array = dataset.values
X = array [:,0:4]
y = array [:,5]
X = X.astype('int64')
y = LabelEncoder().fit_transform(y.astype('str'))

X_train, X_test, y_train, y_test = train_test_split(X,y,
                                                    train_size=0.75, test_size=0.25,random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# define search
model = lgb.LGBMRanker()

# perform the search
model.fit(X, y,
          group=[400,400,423])

# 5) Predictions
test_pred = gbm.predict(X_test)
X_test["predicted_ranking"] = test_pred
X_test.sort_values("predicted_ranking", ascending=False)
jameslamb commented 2 years ago

@sofiavlachou28 Thanks for your interest in LightGBM!

I wrote up a learning-to-rank example tonight to hopefully answer this and other issues you've opened regarding LGBMRanker in the Python package (#5297, #5283).


label_gain

As described in https://lightgbm.readthedocs.io/en/latest/Parameters.html#objective:

label_gain can be used to set the gain (weight) of int label and all values in label must be smaller than number of elements in label_gain

And as described in https://lightgbm.readthedocs.io/en/latest/Parameters.html#label_gain

...only used in lambdarank application


group parameter

As described in https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMRanker.html#lightgbm.LGBMRanker.fit

Group/query data. Only used in the learning-to-rank task. sum(group) = n_samples. For example, if you have a 100-document dataset with group = [10, 20, 40, 10, 10, 10], that means that you have 6 groups, where the first 10 records are in the first group, records 11-30 are in the second group, records 31-70 are in the third group, etc.

This parameter is necessary to tell LightGBM which collections of rows in the training data represent documents from the same "query". If you aren't literally working with search engine data (where you have a list of results return by a single search), you might define "query" as, for example, "all movie ratings created by one user".


Sample Code

I created the following example using Python 3.8.12 and lightgbm installed from source from the latest commit on master (https://github.com/microsoft/LightGBM/commit/9489f878b3568e70b441e5df602483e116f24cc6).

The example below uses LightGBM to build a learning-to-rank model to learn how users in the MovieLens-100K dataset rated different movies.

import io
import os
import zipfile
import lightgbm as lgb
import pandas as pd
import requests
from scipy.stats import spearmanr

def load_movielens(local_dir) -> pd.DataFrame:
    data_url = "https://files.grouplens.org/datasets/movielens/ml-100k.zip"
    if not os.path.isdir(local_dir):
        print(f"creating directory '{local_dir}' to store movielens dataset")
        os.mkdir(local_dir)

    zip_file = zipfile.ZipFile(io.BytesIO(requests.get(data_url).content), "r")

    zip_file.extract(
        member="ml-100k/u.data",
        path="data/"
    )
    rating_df = pd.read_csv(
        "data/ml-100k/u.data",
        sep="\t",
        header=None,
        names=["user_id", "item_id", "rating", "timestamp"]
    )
    zip_file.extract(
        member="ml-100k/u.user",
        path="data/"
    )
    user_df = pd.read_csv(
        "data/ml-100k/u.user",
        sep="|",
        encoding="latin-1",
        header=None,
        names=["user_id", "age", "gender", "occupation", "zip_code"]
    )
    zip_file.extract(
        member="ml-100k/u.item",
        path="data/"
    )
    item_df = pd.read_csv(
        "data/ml-100k/u.item",
        sep="|",
        encoding="latin-1",
        header=None,
        names=[
            "movie_id",
            "movie_title",
            "release_date",
            "video_release_date",
            "imdb_url",
            "genre=unknown",
            "genre=Action",
            "genre=Adventure",
            "genre=Animation",
            "genre=Childrens",
            "genre=Comedy",
            "genre=Crime",
            "genre=Documentary",
            "genre=Drama",
            "genre=Fantasy",
            "genre=Film_Noir",
            "genre=Horror",
            "genre=Musical",
            "genre=Mystery",
            "genre=Romance",
            "genre=Sci_Fi",
            "genre=Thriller",
            "genre=War",
            "genre=Western"
        ]
    )
    out_df = rating_df.merge(
        right=user_df,
        how="left",
        on=["user_id"],
        suffixes=("_rating", "_user")
    )
    out_df = out_df.merge(
        right=item_df,
        how="left",
        left_on=["item_id"],
        right_on=["movie_id"],
        suffixes=(None, "_movie")
    )
    # drop join keys and other unnecessary columns
    out_df.drop(["imdb_url", "item_id", "movie_id", "movie_title", "video_release_date", "zip_code"], axis=1, inplace=True)
    out_df = out_df.sort_values(["user_id"], ignore_index=True)

    # LightGBM assumes rankings begin at 0, but these ratings go from 1 to 5
    rating = out_df.pop("rating").values - 1

    # use "user_id" to group queries
    user_id = out_df.pop("user_id")
    group = user_id.value_counts(sort=False).values

    return out_df, rating, group

# get movielens data
X, y, g = load_movielens("data")

# collapse 1-hot-encoded genre into 1 feature
genre_columns = [c for c in X.columns if c.startswith("genre")]
X["movie_genre"] = X[genre_columns].head().idxmax(1)
X.drop(genre_columns, axis=1, inplace=True)

# create a "movie age" feature
X["movie_age_when_rated"] = (
    pd.to_datetime(X["timestamp"], unit="s") -
    pd.to_datetime(X["release_date"])
) / pd.Timedelta(days=1)
X.drop(["timestamp", "release_date"], axis=1, inplace=True)

# convert "object" columns to unordered categories
for col in X.columns:
    if pd.api.types.is_object_dtype(X[col]):
        X[col] = pd.Categorical(X[col])

Looking at the shape of these objects may be information.

The features include some characteristics of the reviewer and some characteristics of the movies.

print(X.head().to_markdown())
|    |   age | gender   | occupation   | movie_genre   |   movie_age_when_rated |
|---:|------:|:---------|:-------------|:--------------|-----------------------:|
|  0 |    24 | M        | technician   | genre=Crime   |               1362.16  |
|  1 |    24 | M        | technician   | genre=Western |               2133.31  |
|  2 |    24 | M        | technician   | genre=Action  |               6841.15  |
|  3 |    24 | M        | technician   | genre=Comedy  |                362.215 |
|  4 |    24 | M        | technician   | genre=Action  |               1362.16  |

The target is integer ratings from 0 to 4 (where 0 is very bad and 4 is very good).

y[:10]
# array([4, 3, 4, 4, 3, 2, 3, 3, 3, 3])

And group groups all ratings from one user together as one "query".

g[:10]
# array([272,  62,  54,  24, 175, 211, 403,  59,  22, 184])

This says "the first 272 rows in X are one query, then next 62 rows are another query, etc.".

Given data in this format, LGBMRanker can be used to fit a learning-to-rank model.

rnk = lgb.LGBMRanker(
    n_estimators=100,
)
rnk.fit(X=X, y=y, group=g)

To check the in-sample fit, you can use something like spearman correlation, which checks how well the ordering of predicted scores matches the actual ratings.

round(spearmanr(y, rnk.predict(X)).correlation, 5)
# 0.21626

In the Lambdarank application, LightGBM doesn't give equal weight to all positions in the ranking. For example, it will give higher preference to splits that help it choose correctly between the 1st and 2nd most relevant items than splits that help it choose correctly between the 4th and 5th most relevant movies.

This is where the label_gain parameter comes in. That parameter describes how much more importance LightGBM places on the ordering of different items.

For example, in this dataset with 5 possible ratings, something like the following...

label_gain = [1, 2, 4, 8, 16]

says "correctly ordering the first and second most relevant items is twice as important as correctly ordering the second and third most relevant items".

I encourage you to try with different values of this parameter, like this:

rnk = lgb.LGBMRanker(
    n_estimators=100,
    label_gain=[1, 2, 4, 8, 16]
)
rnk.fit(X=X, y=y, group=g)

I hope these examples help! I am going to close and lock #5297 and #5283. If you have other questions about this topic, please ask here.

If you have questions about other LightGBM topics, please open new issues and provide all the information asked for in the issue templatte.


cc @shiyu1994 @StrikerRUS @ffineis please correct me if anything I've said above is imprecise or incorrect

github-actions[bot] commented 1 year ago

This issue has been automatically closed because it has been awaiting a response for too long. When you have time to to work with the maintainers to resolve this issue, please post a new comment and it will be re-opened. If the issue has been locked for editing by the time you return to it, please open a new issue and reference this one. Thank you for taking the time to improve LightGBM!

sathyarr commented 1 year ago

@jameslamb What should we do if there is no constraint on the importance or equal importance? Should we keep label_gain=[1, 1, 1, 1, 1] or label_gain=[1, 2, 3, 4, 5] ?

Also, let's say there are variable length of items in each group. what should be the length of label_gain? equal to max length of any group? for e.g., in my case, max length could be 10000.

any help is appreciated, thanks

lukav27 commented 1 year ago

@sathyarr I think you are correct about label_gain=[1, 1, 1, 1, 1] represents equal importance of all items we are ranking.

About your second topic i think label_gain refers to number of ratings (label mapings), not the number of items being rated, based on Error raised in this issue: Label x is not less than the number of label mappings (y) So if you have n ratings in your dataset you should have at least same number of values in label_gain list, irrespective of number of items being rated. I can confirm it works when number of items is higher than number of label_gain values.

I think the misunderstanding comes from this:

says "correctly ordering the first and second most relevant items is twice as important as correctly ordering the second and third most relevant items".

and it is more correct to say: correctly labeling items as first or second highest score is twice as important as correctly labeling items as second or third highest score This makes sense since model returns score and not direct ordering and multiple items can have same value (at least in training set).

I would be very grateful if @jameslamb or someone with greater understanding of model than me can confirm or deny this, thanks

sathyarr commented 1 year ago

Thanks for the comment @lukav27 makes sense, let's wait for any contributor comments! 🙂

jameslamb commented 10 months ago

Re-opening this since there are unanswered questions, but I personally would need to do some research before providing an answer.