Open sofiavlachou28 opened 3 years ago
Hi @sofiavlachou28 , thanks very much for using LightGBM!
I'd be happy to help you, but we need a little more information.
lightgbm
are you using and how did you install it?"Posts.csv"
), or to replicate this problem using randomly-created data?@sofiavlachou28 please refer to parameter label_gain
Hi @sofiavlachou28 , thanks very much for using LightGBM!
I'd be happy to help you, but we need a little more information.
- What version of
lightgbm
are you using and how did you install it?- Are you able to provide access to the raw data (
"Posts.csv"
), or to replicate this problem using randomly-created data?
@jameslamb I use the Version 3.3.1. Also, I uploaded my csv dataset for more understaning. Posts.csv
Thanks for your time!! :)
@jameslamb Also, I saw that lightgbm requires: numpy, scipy, scikit-learn, whee
l. I use the latest versions of numpy, scipy, sklearn. But when I try to upgrade the "weel
" a new error occures:
(venv) C:\Users\USER\pythonProject>pip install wheel==0.37.0
Collecting wheel==0.37.0
Downloading wheel-0.37.0-py2.py3-none-any.whl (35 kB)
Installing collected packages: wheel
Attempting uninstall: wheel
Found existing installation: wheel 0.36.2
Not uninstalling wheel at c:\users\user\appdata\local\programs\python\python37\lib\site-packages, outside environment c:\users\user\pythonproject\venv
Can't uninstall 'wheel'. No files were found to uninstall.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following de
pendency conflicts.
tf-nightly-gpu 2.5.0.dev20210223 requires gast==0.4.0, but you have gast 0.3.3 which is incompatible.
tf-nightly-gpu 2.5.0.dev20210223 requires grpcio~=1.34.0, but you have grpcio 1.32.0 which is incompatible.
tf-nightly-gpu 2.5.0.dev20210223 requires h5py~=3.1.0, but you have h5py 2.10.0 which is incompatible.
tf-nightly-gpu 2.5.0.dev20210223 requires six~=1.15.0, but you have six 1.16.0 which is incompatible.
tf-nightly-gpu 2.5.0.dev20210223 requires typing-extensions~=3.7.4, but you have typing-extensions 3.10.0.0 which is incompatible.
tensorflow 2.4.1 requires six~=1.15.0, but you have six 1.16.0 which is incompatible.
tensorflow 2.4.1 requires typing-extensions~=3.7.4, but you have typing-extensions 3.10.0.0 which is incompatible.
tensorflow-gpu 2.4.1 requires six~=1.15.0, but you have six 1.16.0 which is incompatible.
tensorflow-gpu 2.4.1 requires typing-extensions~=3.7.4, but you have typing-extensions 3.10.0.0 which is incompatible.
Successfully installed wheel-0.37.0
(venv) C:\Users\USER\pythonProject>
I dont understand what happens again...
Any idea?
Thanks a lot..!
@sofiavlachou28 Thanks for using LightGBM! Ranking objectives in LightGBM use label_gain_
to store the gain of each label value. By default, label_gain_[i] = (1 << i) - 1
. So the default label gain only works with a maximum label value 31
. It seems that your dataset contains label values greater than 31
. So please specify your customized label_gain
, as @guolinke mentioned.
@sofiavlachou28 Thanks for using LightGBM! Ranking objectives in LightGBM use
label_gain_
to store the gain of each label value. By default,label_gain_[i] = (1 << i) - 1
. So the default label gain only works with a maximum label value31
. It seems that your dataset contains label values greater than31
. So please specify your customizedlabel_gain
, as @guolinke mentioned.
Okay. I will try it with this parameter. I hope it works..! Τhanks for your help!
Thanks for providing the dataset @sofiavlachou28 .
Looking at the data, I have some observations and a suggestion.
The label
for a learning-to-rank problem is expected to be a "relevance score", explaining how relevant one document is compared to another. (see this Stack Overflow answer for a concise explanation).
If you set label_gain
to the maximum value in y_train
as suggested in https://github.com/microsoft/LightGBM/issues/4808#issuecomment-973991963, model training might run without throwing any errors, but (making some assumptions about your data based on column names), I don't think the model generated will be what you intended.
It seems you're using column nwords
for the label, which I assume is "number of words in the post". If you want to use LightGBM to predict the number of words in a document based on how popular it was (likes
, comments
), I recommend treating that as a regression problem and using LGBMRegressor
, not LGBMRanker
.
One other suggestion...I noticed the full dataset has only 91 rows, even before holding out some data for validation.
LightGBM has a few parameters to limit model complexity, whose defaults are set to work well with medium-sized datasets (1000s of observations). If you want LightGBM to learn from 92 observations, consider setting a very small value (like 2) for parameter min_data_in_leaf
(link).
Thanks for providing the dataset @sofiavlachou28 .
Looking at the data, I have some observations and a suggestion.
The
label
for a learning-to-rank problem is expected to be a "relevance score", explaining how relevant one document is compared to another. (see this Stack Overflow answer for a concise explanation).If you set
label_gain
to the maximum value iny_train
as suggested in #4808 (comment), model training might run without throwing any errors, but (making some assumptions about your data based on column names), I don't think the model generated will be what you intended.It seems you're using column
nwords
for the label, which I assume is "number of words in the post". If you want to use LightGBM to predict the number of words in a document based on how popular it was (likes
,comments
), I recommend treating that as a regression problem and usingLGBMRegressor
, notLGBMRanker
.One other suggestion...I noticed the full dataset has only 91 rows, even before holding out some data for validation.
sample code (click me) LightGBM has a few parameters to limit model complexity, whose defaults are set to work well with medium-sized datasets (1000s of observations). If you want LightGBM to learn from 92 observations, consider setting a very small value (like 2) for parameter
min_data_in_leaf
(link).
Τhank you so much for your reply! it is very helpful! I will look at your observations more carefully and I will test my data again, as you suggest. I hope it works. If not, I will open the topic again!
Have a good day! Sofia
This issue has been automatically closed because it has been awaiting a response for too long. When you have time to to work with the maintainers to resolve this issue, please post a new comment and it will be re-opened. If the issue has been locked for editing by the time you return to it, please open a new issue and reference this one. Thank you for taking the time to improve LightGBM!
Hello to everyone!
After some research and a few hours of hacking, my code is still not working. I do not find anywhere information on how to set the label_gain
as you suggested above!
_My task is to find the most popular product/s each time based on likes, comments, or frequency and so on, and I want to do this with Ranking.
Here is my code! I am new to Python ! Can anyone help me??
# 1) Load Dependencies
import pandas as pd
import numpy as np
from numpy import unique
from sklearn import metrics
import warnings
warnings.filterwarnings('ignore')
import lightgbm as lgb
# Core Model
gbm = lgb.LGBMRanker(objective="lambdarank", )
# 2) Load the Data
# Define Columns
names = ["memespostsfrequency","comments","likes","nwords"]
data = pd.read_csv("InstaPosts.csv", encoding="utf-8", sep=";", delimiter=None,
names=names, delim_whitespace=False,
nrows=181,header=0, engine="python")
X = data.values[:,0:2]
y = data.values[:,3]
# 3) Define the Training Data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=1)
query_train = [X_train.shape[0]]
query_val = [X_val.shape[0]]
query_test = [X_test.shape[0]]
# 4) Model Fit
gbm.fit(X_train,
y_train,
group=query_train,
eval_set=[(X_val, y_val)],
eval_group=[query_val], #values=[label_gain(1,5)],
eval_metric=["ndcg"],
eval_at=[3],
early_stopping_rounds=10)
# 5) Predictions
test_pred = gbm.predict(X_test)
X_test["predicted_ranking"] = test_pred
X_test.sort_values("predicted_ranking", ascending=False)
And here is the bug:
C:\Users\USER\pythonProject\venv\Scripts\python.exe "C:/Users/USER/pythonProject/Memes LightGBM Algorithm.py"
[LightGBM] [Fatal] Label 72 is not less than the number of label mappings (31)
Traceback (most recent call last):
File "C:/Users/USER/pythonProject/Memes LightGBM Algorithm.py", line 87, in <module>
early_stopping_rounds=10)
File "C:\Users\USER\pythonProject\venv\lib\site-packages\lightgbm\sklearn.py", line 1071, in fit
categorical_feature=categorical_feature, callbacks=callbacks, init_model=init_model)
File "C:\Users\USER\pythonProject\venv\lib\site-packages\lightgbm\sklearn.py", line 758, in fit
callbacks=callbacks
File "C:\Users\USER\pythonProject\venv\lib\site-packages\lightgbm\engine.py", line 271, in train
booster = Booster(params=params, train_set=train_set)
File "C:\Users\USER\pythonProject\venv\lib\site-packages\lightgbm\basic.py", line 2613, in __init__
ctypes.byref(self.handle)))
File "C:\Users\USER\pythonProject\venv\lib\site-packages\lightgbm\basic.py", line 125, in _safe_call
raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: Label 72 is not less than the number of label mappings (31)
Process finished with exit code 1
Thank you in advance!! Sofia V.
I had the same problem with label mappings when I tried to use LGBMRanker with Optuna. I did them work well together by following this example. As suggested, I set the label_gain
parameter as:
[i for i in range(max(y_train.max(), y_valid.max()) + 1)]
I hope it can also help you!
Best,
I had the same problem with label mappings when I tried to use LGBMRanker with Optuna. I did them work well together by following this example. As suggested, I set the
label_gain
parameter as:[i for i in range(max(y_train.max(), y_valid.max()) + 1)]
I hope it can also help you!
Best,
@thaisalmeida Thanks for your reply!
I don't understand how to set the label_gain in my code... still raises an error:lightgbm.basic.LightGBMError: Label 47 is not less than the number of label mappings (31)
Can share your code ? or if you don't want, can you help me to set this parameter in my code below?
Thank you in advance :) !!
Here is my New code:
# Dependencies
import pandas as pd
from pandas import set_option
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import RepeatedStratifiedKFold
from pandas import read_csv
import numpy as np
from numpy import unique
from sklearn import metrics
# LGBMRanker
import lightgbm as lgb
from lightgbm import LGBMRanker
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')
# Load data
names = ["label","id","comments","likes","product 1 frequency","product 2 frequency"]
dataset = pd.read_csv("Ranking problem.csv", names=names, encoding="utf-8", error_bad_lines=True,
skip_blank_lines=True, sep=",", delimiter=None, doublequote=True, keep_default_na=True,
nrows=1223, header=6, engine="python")
# Shape
print(dataset.shape)
# Max labels
max_label = dataset.label.nunique()
print(max_label)
# Core Model
gbm = lgb.LGBMRanker(objective="lambdarank", )
# label_gain=np.arange(1, max_label+1)
# Split the data in train and test
array = dataset.values
X = array [:,0:4]
y = array [:,5]
X = X.astype('int64')
y = LabelEncoder().fit_transform(y.astype('str'))
X_train, X_test, y_train, y_test = train_test_split(X,y,
train_size=0.75, test_size=0.25,random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# define search
model = lgb.LGBMRanker()
# perform the search
model.fit(X, y,
group=[400,400,423])
# 5) Predictions
test_pred = gbm.predict(X_test)
X_test["predicted_ranking"] = test_pred
X_test.sort_values("predicted_ranking", ascending=False)
@sofiavlachou28 Thanks for your interest in LightGBM!
I wrote up a learning-to-rank example tonight to hopefully answer this and other issues you've opened regarding LGBMRanker
in the Python package (#5297, #5283).
label_gain
As described in https://lightgbm.readthedocs.io/en/latest/Parameters.html#objective:
label_gain can be used to set the gain (weight) of int label and all values in label must be smaller than number of elements in
label_gain
And as described in https://lightgbm.readthedocs.io/en/latest/Parameters.html#label_gain
...only used in
lambdarank
application
group
parameterAs described in https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMRanker.html#lightgbm.LGBMRanker.fit
Group/query data. Only used in the learning-to-rank task. sum(group) = n_samples. For example, if you have a 100-document dataset with group = [10, 20, 40, 10, 10, 10], that means that you have 6 groups, where the first 10 records are in the first group, records 11-30 are in the second group, records 31-70 are in the third group, etc.
This parameter is necessary to tell LightGBM which collections of rows in the training data represent documents from the same "query". If you aren't literally working with search engine data (where you have a list of results return by a single search), you might define "query" as, for example, "all movie ratings created by one user".
I created the following example using Python 3.8.12 and lightgbm
installed from source from the latest commit on master
(https://github.com/microsoft/LightGBM/commit/9489f878b3568e70b441e5df602483e116f24cc6).
The example below uses LightGBM to build a learning-to-rank model to learn how users in the MovieLens-100K dataset rated different movies.
import io
import os
import zipfile
import lightgbm as lgb
import pandas as pd
import requests
from scipy.stats import spearmanr
def load_movielens(local_dir) -> pd.DataFrame:
data_url = "https://files.grouplens.org/datasets/movielens/ml-100k.zip"
if not os.path.isdir(local_dir):
print(f"creating directory '{local_dir}' to store movielens dataset")
os.mkdir(local_dir)
zip_file = zipfile.ZipFile(io.BytesIO(requests.get(data_url).content), "r")
zip_file.extract(
member="ml-100k/u.data",
path="data/"
)
rating_df = pd.read_csv(
"data/ml-100k/u.data",
sep="\t",
header=None,
names=["user_id", "item_id", "rating", "timestamp"]
)
zip_file.extract(
member="ml-100k/u.user",
path="data/"
)
user_df = pd.read_csv(
"data/ml-100k/u.user",
sep="|",
encoding="latin-1",
header=None,
names=["user_id", "age", "gender", "occupation", "zip_code"]
)
zip_file.extract(
member="ml-100k/u.item",
path="data/"
)
item_df = pd.read_csv(
"data/ml-100k/u.item",
sep="|",
encoding="latin-1",
header=None,
names=[
"movie_id",
"movie_title",
"release_date",
"video_release_date",
"imdb_url",
"genre=unknown",
"genre=Action",
"genre=Adventure",
"genre=Animation",
"genre=Childrens",
"genre=Comedy",
"genre=Crime",
"genre=Documentary",
"genre=Drama",
"genre=Fantasy",
"genre=Film_Noir",
"genre=Horror",
"genre=Musical",
"genre=Mystery",
"genre=Romance",
"genre=Sci_Fi",
"genre=Thriller",
"genre=War",
"genre=Western"
]
)
out_df = rating_df.merge(
right=user_df,
how="left",
on=["user_id"],
suffixes=("_rating", "_user")
)
out_df = out_df.merge(
right=item_df,
how="left",
left_on=["item_id"],
right_on=["movie_id"],
suffixes=(None, "_movie")
)
# drop join keys and other unnecessary columns
out_df.drop(["imdb_url", "item_id", "movie_id", "movie_title", "video_release_date", "zip_code"], axis=1, inplace=True)
out_df = out_df.sort_values(["user_id"], ignore_index=True)
# LightGBM assumes rankings begin at 0, but these ratings go from 1 to 5
rating = out_df.pop("rating").values - 1
# use "user_id" to group queries
user_id = out_df.pop("user_id")
group = user_id.value_counts(sort=False).values
return out_df, rating, group
# get movielens data
X, y, g = load_movielens("data")
# collapse 1-hot-encoded genre into 1 feature
genre_columns = [c for c in X.columns if c.startswith("genre")]
X["movie_genre"] = X[genre_columns].head().idxmax(1)
X.drop(genre_columns, axis=1, inplace=True)
# create a "movie age" feature
X["movie_age_when_rated"] = (
pd.to_datetime(X["timestamp"], unit="s") -
pd.to_datetime(X["release_date"])
) / pd.Timedelta(days=1)
X.drop(["timestamp", "release_date"], axis=1, inplace=True)
# convert "object" columns to unordered categories
for col in X.columns:
if pd.api.types.is_object_dtype(X[col]):
X[col] = pd.Categorical(X[col])
Looking at the shape of these objects may be information.
The features include some characteristics of the reviewer and some characteristics of the movies.
print(X.head().to_markdown())
| | age | gender | occupation | movie_genre | movie_age_when_rated |
|---:|------:|:---------|:-------------|:--------------|-----------------------:|
| 0 | 24 | M | technician | genre=Crime | 1362.16 |
| 1 | 24 | M | technician | genre=Western | 2133.31 |
| 2 | 24 | M | technician | genre=Action | 6841.15 |
| 3 | 24 | M | technician | genre=Comedy | 362.215 |
| 4 | 24 | M | technician | genre=Action | 1362.16 |
The target is integer ratings from 0 to 4 (where 0 is very bad and 4 is very good).
y[:10]
# array([4, 3, 4, 4, 3, 2, 3, 3, 3, 3])
And group
groups all ratings from one user together as one "query".
g[:10]
# array([272, 62, 54, 24, 175, 211, 403, 59, 22, 184])
This says "the first 272 rows in X
are one query, then next 62 rows are another query, etc.".
Given data in this format, LGBMRanker
can be used to fit a learning-to-rank model.
rnk = lgb.LGBMRanker(
n_estimators=100,
)
rnk.fit(X=X, y=y, group=g)
To check the in-sample fit, you can use something like spearman correlation, which checks how well the ordering of predicted scores matches the actual ratings.
round(spearmanr(y, rnk.predict(X)).correlation, 5)
# 0.21626
In the Lambdarank application, LightGBM doesn't give equal weight to all positions in the ranking. For example, it will give higher preference to splits that help it choose correctly between the 1st and 2nd most relevant items than splits that help it choose correctly between the 4th and 5th most relevant movies.
This is where the label_gain
parameter comes in. That parameter describes how much more importance LightGBM places on the ordering of different items.
For example, in this dataset with 5 possible ratings, something like the following...
label_gain = [1, 2, 4, 8, 16]
says "correctly ordering the first and second most relevant items is twice as important as correctly ordering the second and third most relevant items".
I encourage you to try with different values of this parameter, like this:
rnk = lgb.LGBMRanker(
n_estimators=100,
label_gain=[1, 2, 4, 8, 16]
)
rnk.fit(X=X, y=y, group=g)
I hope these examples help! I am going to close and lock #5297 and #5283. If you have other questions about this topic, please ask here.
If you have questions about other LightGBM topics, please open new issues and provide all the information asked for in the issue templatte.
cc @shiyu1994 @StrikerRUS @ffineis please correct me if anything I've said above is imprecise or incorrect
This issue has been automatically closed because it has been awaiting a response for too long. When you have time to to work with the maintainers to resolve this issue, please post a new comment and it will be re-opened. If the issue has been locked for editing by the time you return to it, please open a new issue and reference this one. Thank you for taking the time to improve LightGBM!
@jameslamb What should we do if there is no constraint on the importance or equal importance?
Should we keep label_gain=[1, 1, 1, 1, 1]
or label_gain=[1, 2, 3, 4, 5]
?
Also, let's say there are variable length of items in each group. what should be the length of label_gain
? equal to max length of any group? for e.g., in my case, max length could be 10000.
any help is appreciated, thanks
@sathyarr I think you are correct about label_gain=[1, 1, 1, 1, 1] represents equal importance of all items we are ranking.
About your second topic i think label_gain refers to number of ratings (label mapings), not the number of items being rated, based on Error raised in this issue: Label x is not less than the number of label mappings (y) So if you have n ratings in your dataset you should have at least same number of values in label_gain list, irrespective of number of items being rated. I can confirm it works when number of items is higher than number of label_gain values.
I think the misunderstanding comes from this:
says "correctly ordering the first and second most relevant items is twice as important as correctly ordering the second and third most relevant items".
and it is more correct to say: correctly labeling items as first or second highest score is twice as important as correctly labeling items as second or third highest score This makes sense since model returns score and not direct ordering and multiple items can have same value (at least in training set).
I would be very grateful if @jameslamb or someone with greater understanding of model than me can confirm or deny this, thanks
Thanks for the comment @lukav27 makes sense, let's wait for any contributor comments! 🙂
Re-opening this since there are unanswered questions, but I personally would need to do some research before providing an answer.
Hello to everyone!!
I am new to Python and Iam getting this error when running LightGBM about a Ranking problem:
lightgbm.basic.LightGBMError: Label 72 is not less than the number of label mappings (31)
I tried to search for this error, could not find much useful resources.
I cant guess where the error occurs. Μy dataset consists of 4 columns: ["Frequency","Comments", "Likes", "Nwords"] as seen below.
Can anyone help me??
Thank you in advance !!
Sofia