microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.61k stars 3.83k forks source link

Lift restrictions on feature names ("LightGBMError: Do not support special JSON characters in feature name") #6202

Closed fingoldo closed 2 months ago

fingoldo commented 11 months ago

Summary

Currently, it can be hard to plug LightGBM into the existing ML system because of its selectivity to feature naming. Underscores, commas, dots, square brackets or even non-English language symbols trigger "LightGBMError: Do not support special JSON characters in feature name." Often it's hard to even understand about what name exactly LightGBM is complaining, you have to scroll through thousands of features to figure out which one is named "wrongly".

I know for sure that the naming of the features should have no influence on the model training process. It would be great if this limitation could be lifted.

Motivation

This limitation is very cumbersome. I am not aware of any other machine learning library that imposes such restrictions. Often features come in groups, and it's convenient to use underscores and dots/brackets for separation, for example "[bioteam].[physio].prevweak_velocity_mean". Without the ability to group, in practice, feature names quickly become lengthy and totally unreadable.

Similarly, commas are often used as units: "distance, km".

Or the dataset comes in some national language, be it Chinese, French, or Russian, and stakeholders would love to see features in their native language. We have UTF, let's use it and work on allowing arbitrary feature names. Let's not limit the creativity of data scientists! )

Description

import lightgbm
import pandas as pd, numpy as np
from lightgbm import LGBMClassifier

nsamples=50

X_train = pd.DataFrame(data=np.random.random(size=(nsamples, 4)),columns=['a','b','c','distance, km'])
est=LGBMClassifier(verbose=0)
est.fit(X_train, np.random.randint(0, 2, size=nsamples))

C:\ProgramData\Anaconda3\lib\site-packages\lightgbm\basic.py in _safe_call(ret) 240 """ 241 if ret != 0: --> 242 raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8')) 243 244

LightGBMError: Do not support special JSON characters in feature name.

I don't know the technical reasons for this, but I can't find any logical reason to have this limitation.

Environment:

Python==3.8 lightgbm==4.1.0 OS==Windows Locale=Russian

--

jameslamb commented 11 months ago

Thanks for using LightGBM.

Please provide a reproducible example showing exactly how you hit this error and describing what you expected to happen. Your submission here suggesting that non-ASCII characters or feature names with _ causes this error is not correct, at least for the Python and R packages as of LightGBM v4.0.0.

For example, consider the following code:

import lightgbm as lgb
from sklearn.datasets import make_regression

X, y = make_regression(n_samples=1_000, n_features=2)

feature_names = [
   "name_with_underscores",
    # what Google translate provides in Chinese for "feature"
    "特征"
]

dtrain = lgb.Dataset(X, label=y, feature_name=feature_names)
model = lgb.train(
    train_set=dtrain,
    params={"objective": "regression"},
    num_boost_round=5
)
model.feature_name()
# ['name_with_underscores', '特征']

Using lightgbm==4.1.0 and Python 3.11.5 on macOS, I see training succeed and LightGBM preserve feature names with _ and non-ASCII characters successfully. My machine is set to the en_US.UTF-8.

If you're unfamiliar with how to create reproducible examples when asking for software help, this guide is useful: https://stackoverflow.com/help/minimal-reproducible-example. You could try modifying the example I've given here, and providing the following (which were asked for in the issue template when you clicked New Issue):

fingoldo commented 11 months ago

Thanks for using LightGBM.

Please provide a reproducible example showing exactly how you hit this error and describing what you expected to happen. Your submission here suggesting that non-ASCII characters or feature names with _ causes this error is not correct, at least for the Python and R packages as of LightGBM v4.0.0.

For example, consider the following code:

import lightgbm as lgb
from sklearn.datasets import make_regression

X, y = make_regression(n_samples=1_000, n_features=2)

feature_names = [
   "name_with_underscores",
    # what Google translate provides in Chinese for "feature"
    "特征"
]

dtrain = lgb.Dataset(X, label=y, feature_name=feature_names)
model = lgb.train(
    train_set=dtrain,
    params={"objective": "regression"},
    num_boost_round=5
)
model.feature_name()
# ['name_with_underscores', '特征']

Using lightgbm==4.1.0 and Python 3.11.5 on macOS, I see training succeed and LightGBM preserve feature names with _ and non-ASCII characters successfully. My machine is set to the en_US.UTF-8.

If you're unfamiliar with how to create reproducible examples when asking for software help, this guide is useful: https://stackoverflow.com/help/minimal-reproducible-example. You could try modifying the example I've given here, and providing the following (which were asked for in the issue template when you clicked New Issue):

  • how you are using LightGBM (R package? Python package? CLI)
  • LightGBM version?
  • how did you install LightGBM?
  • operating system?
  • and since this question is related to encoding of strings... locale?

Sorry, I was not sure what was the trigger in my case, turns out, it was a comma. I have updated my main post with a reproducible example. Stack overflow is full of suggestions to remove anything non-ascii though.

jameslamb commented 11 months ago

Ah ok, I see you've edited this since it was initially posted to include some more examples.

dots/brackets for separation... for example [bioteam].[physio].prevweak_velocity_mean

commas ... distance, km

Yes, characters like that are not allowed in feature names. You can search the repo for that error message and find the corresponding code here:

https://github.com/microsoft/LightGBM/blob/18dbd65e57995618ee2a8b1f7e4cb0df1f9c6333/include/LightGBM/dataset.h#L889-L892

which calls this:

https://github.com/microsoft/LightGBM/blob/18dbd65e57995618ee2a8b1f7e4cb0df1f9c6333/include/LightGBM/utils/common.h#L886-L902

You can see that that it is specifically a very small subset of characters that are forbidden in feature names.

I can't find any logical reason to have this limitation

LightGBM supports reading training data from TSV (tab-separated), CSV (comma-separated), and LibSVM formats.

It also writes out model data (including feature names) to JSON and to a LightGBM-specific text format.

Characters that are used in encoding/decoding such data, like , to separate columns in CSV or ] to indicate the end of an array in JSON, can break parsers unless they're escaped.

To prevent having to worry about such problems in LightGBM, the library prohibits those characters. We feel that's a small inconvenience in exchange for the reduction in maintenance burden and other sources of user pain (like anything parsing LightGBM model files needing to also account for such escaping).


When you say "lift restrictions", which of these behaviors would you prefer LightGBM took on?

I'd welcome a PR to improve this error message ("special JSON characters" is not very informative), but before we commit to any other change I'd like to hear your thoughts on how you'd prefer LightGBM handle this situation.

jameslamb commented 11 months ago

Stack overflow is fullof suggestions to remove anything non-ascii though

I'm sorry that you found that answer that implied that non-ASCII feature names were an issue. Non-ASCII feature names have been supported in LightGBM since April 2020. For example, here's a post from another LightGBM maintainer back in 2021 about a similar question: https://github.com/microsoft/LightGBM/issues/2478#issuecomment-797145049

fingoldo commented 11 months ago

Ah ok, I see you've edited this since it was initially posted to include some more examples.

dots/brackets for separation... for example [bioteam].[physio].prevweak_velocity_mean commas ... distance, km

Yes, characters like that are not allowed in feature names. You can search the repo for that error message and find the corresponding code here:

https://github.com/microsoft/LightGBM/blob/18dbd65e57995618ee2a8b1f7e4cb0df1f9c6333/include/LightGBM/dataset.h#L889-L892

which calls this:

https://github.com/microsoft/LightGBM/blob/18dbd65e57995618ee2a8b1f7e4cb0df1f9c6333/include/LightGBM/utils/common.h#L886-L902

You can see that that it is specifically a very small subset of characters that are forbidden in feature names.

I can't find any logical reason to have this limitation

LightGBM supports reading training data from TSV (tab-separated), CSV (comma-separated), and LibSVM formats.

It also writes out model data (including feature names) to JSON and to a LightGBM-specific text format.

Characters that are used in encoding/decoding such data, like , to separate columns in CSV or ] to indicate the end of an array in JSON, can break parsers unless they're escaped.

To prevent having to worry about such problems in LightGBM, the library prohibits those characters. We feel that's a small inconvenience in exchange for the reduction in maintenance burden and other sources of user pain (like anything parsing LightGBM model files needing to also account for such escaping).

When you say "lift restrictions", which of these behaviors would you prefer LightGBM took on?

  • replacing those characters with something else like _
  • preserving those characters and dealing with escaping
  • ignoring all feature names if any have such characters, but continuing training
  • something else

I'd welcome a PR to improve this error message ("special JSON characters" is not very informative), but before we commit to any other change I'd like to hear your thoughts on how you'd prefer LightGBM handle this situation.

Thank you so much for such a fast and informative answer! It's actually not a "small inconvenience" at all when you try to add LightGBM to existing models (that all work with established feature names without any questions), and it breaks :-)

IMHO the best way to deal with such characters would be to escape them inside of LightGBM (transparently to a user), I guess other libraries do that, since no one else restricts feature names.

Ideal scenario would be compatibility with other ML libs, i.e., no restrictions and renaming. But you are right, breaking of training, especially without giving the exact reason, is the worst of all evils. So, the second best option for me would be allowing to train with any features and a warning that particular features were renamed for some reason. But it sounds really not good, too, as it means incompatibility with existing pipeline after saving/loading. From the user perspective, renaming is also not great due to the inconsistency of, say, feature importance charts when presenting models to stakeholders.

StrikerRUS commented 2 months ago

Closed in favor of being in https://github.com/microsoft/LightGBM/issues/2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.