microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.56k stars 3.82k forks source link

[CLI] Categorical: Read string and convert to int on the fly #789

Closed AbdealiLoKo closed 5 years ago

AbdealiLoKo commented 7 years ago

This is a feature request

It would be very useful to be able to read in Categorical values as strings ("abc", "def", etc) and convert that to integers internally.

This adds a bit of overhead, but would be much easier for users. There would probably be some overhead to do this, so a flag can be made if the user wants to do it, or we can automatically check if the column has any alphabet or not

limexp commented 7 years ago

@AbdealiJK, @guolinke,

There are two points to consider.

  1. Reading data. Is using categorical_column parameter is required to be filled? Or feature must became categorial as soon as any non-numeric value is found? I'm quite sure that decision must be based on categorial_columns parameter. Also string representations (like 'NA' or 'nan') for missing values must be defined.

  2. Saving and exporting. We can save additional block with strings to integers mapping. Or we can try to use intial string values. The latter approach could lead to many errors (for example, with legacy code).

AbdealiLoKo commented 7 years ago

I think:

  1. categorical_column would be the better and simpler approach to do it
  2. string to integer mapping is the best way to do it. Using initial string values throughout would require a lot of refactoring and also integer mapping is more memory efficient
limexp commented 7 years ago

@AbdealiJK

  1. I didn't suggest to refactor everything and add strings to the kernel. The question was about representation of split value in the saved model. It could be integer index (with mapping saved apart) or a string value itself.
Laurae2 commented 7 years ago

@limexp string value would increase model size and is not efficient, it can also cause major issues (not including the potential name clashes and what to do with characters which are not compliant to the text format used by LightGBM)

For instance a space can be different to another space while being visually identical.

limexp commented 7 years ago

@Laurae2, I totally agree. And each decision has its pros and cons. This feature would be great for CLI, so data can be used without preprocessing. Is it really important for python or R interfaces? In any case we lose control over NaN values.

Laurae2 commented 7 years ago

@limexp R and Python already have the conversion from categoricals to integers.

But if the model deployment in production is done using CLI, one must create a script to convert categoricals to their appropriate integers (usually done with SQL or any other data warehousing software).

dah33 commented 7 years ago

I was using categorical features on a Kaggle Kernel. Python converts NaNs to "-1" when you convert values within categoricals to integers. This causes LightGBM to bomb out with a fatal error. This is silent on Kaggle, and kills the kernel.

The solution is:

for col in categorical_vars:
    df[col] = pd.Categorical(df[col].cat.codes+1)

This assumes you have created Categorical columns in Pandas in the first place, e.g.

df[col] = df[col].astype('category')
guolinke commented 7 years ago

@dah33 you can use NA to represent missing. @wxchan can we add a conversion for the -1 (to NaN) in python package ?

Laurae2 commented 7 years ago

@guolinke @wxchan In Python package we may enforce anything negative is NaN for categorical variables (strictly -1 only for NaN would be strange).

guolinke commented 7 years ago

@Laurae2 okay, maybe I can support this in c++ side, and gives a warning for this conversion.

henry0312 commented 7 years ago

There is a very difficult problem: we cannot pass categories (list of values) to auto convert. This becomes a problem when one know true categories and all of them doesn't apper in train data.

for col in categorical_vars:
    df[col] = pd.Categorical(df[col].cat.codes+1, categories=['A', 'B', 'C', ... ])
henry0312 commented 7 years ago

There is an another problem: we cannot know if a value is not in categories or missing value, because pandas.Categorical encode both of them to -1

henry0312 commented 7 years ago

Additonaly, pandas.Categories encode labels to int accoriding to their order of appearance, I guess, so we may not reproduce the same encoding when predicting.

This will be solved by passing categories (like https://github.com/Microsoft/LightGBM/issues/789#issuecomment-322778919).

dah33 commented 7 years ago

Is there a distinction to be made between nominal and ordinal features when worrying about known categories that are not present in the training data?

Can ordinal values be encoded as floats just using Series.cat.codes.astype(float)? They would then be ordered correctly.

Nominal values if not present in training set, then is there advantage in encoding them as anything other than NaN? When would a node ever use them in a decision?

Laurae2 commented 7 years ago

@dah33 ordinal features should be numerical, while nominal features should be categorical.

dah33 commented 7 years ago

@henry0312 nominal categories pandas encodes to integers in order of appearance. Ordinal categories are encoded in a way that preserves order (0 for A, 1 for B, 2 for C, etc) even if that category did not appear in the data frame

dah33 commented 7 years ago

@Laurae2 thanks for the tip. The terminology is a bit misleading as pandas calls ordinals just a "Categorical" with an order. Whereas LightGBM has categorical_features='auto' which detects Categoricals but really this should only be handed nominals, as you say.

geoHeil commented 7 years ago

@Laurae2 I have a question regarding the encoding: https://github.com/Microsoft/LightGBM/blob/master/python-package/lightgbm/sklearn.py#L532 is using eval_set[i] = (valid_x, self._le.transform(valid_y)) but the _le i.e. scikit-learn.LabelEncoder. However that one definitely fails on unseen labels during transform. How does it then magically work as outlined in Microsoft/LightGBM#804 to properly handle unseen values?

Laurae2 commented 7 years ago

@geoHeil I don't use scikit-learn transformers as they are known to have shady issues on transformers / supervised machine learning (like https://github.com/scikit-learn/scikit-learn/issues/3956 about your LabelEncoder).

It is better to prepare oneself the categorical features before feeding to LightGBM the DMatrix, or to use custom converters like the one I made in R for LightGBM, this way you know you are doing the right preprocessing out of the box:

Laurae2 commented 7 years ago

@dah33 An ordinal scale is still debatable whether it is continuous or discrete (in theory). But for LightGBM, it is better to feed them as numeric because:

geoHeil commented 7 years ago

@Laurae2 a couple of days you mentioned that

The Python wrapper abstracts the categorical conversion (String -> Int) and converts it for you. and that is https://github.com/Microsoft/LightGBM/blob/master/python-package/lightgbm/compat.py#L75 so I wonder if I should lightGBM's python wrapper to automate this conversion as this is still only using a LabelEncoder which as far as I know can't handle unseen data.

Laurae2 commented 7 years ago

@geoHeil I recommend a using a separate converter because there's no way a saved model can remember something which is not native (a LightGBM saved model does not know what is Python).

As with any preprocessing steps, they must be separate to the LightGBM interface (explicit), not inside the LightGBM interface (abstracted). The conversion is done for the user convenience (like what lgb.cv does), but there are inherent drawbacks the user should automatically know as it is a preprocessing step the LightGBM model cannot have.

I don't know exactly how it is done the Python package, but @wxchan probably knows more about how categorical features are handled when predicting from a model (whether it is a fresh loaded model or a newly-trained model).

In R, you must pass the rule converter as a preprocessing step, which does the heavy lifting work for features. If the rule converter is not saved nor used, then you cannot predict properly from new data.

geoHeil commented 7 years ago

@Laurae2 thanks for the clarification. @wxchan can you clarify this for python?

wxchan commented 7 years ago

cat.codes of pandas categorical features will be saved to model after training and read from model during prediction.

geoHeil commented 7 years ago

@wxchan thanks. So https://github.com/Microsoft/LightGBM/blob/cc771df49941f1045bcca52ea97c00288d319dca/python-package/lightgbm/basic.py#L240 is storing it - but where is this information used in the transform part i.e. where possibly unseen categories are handled?

wxchan commented 7 years ago

@geoHeil store in L231, read from L237 unseen will be -1 as pandas rule of cat.codes I think, you can make up a small dataset to check.

geoHeil commented 7 years ago

@wxchan thanks. Regarding the number of levels i.e. for a String address field the number of distinct categorical levels is pretty big what would you suggest in this case?

wxchan commented 7 years ago

@geoHeil not sure I understand your question. Do you mean address of street? I think you can either merge several rare categories into one big category, or extract some common information from this feature (like city of address).

geoHeil commented 7 years ago

@wxchan exactly, I thought I had seen some min_cat and max_catparameters around the lightGBM documentation- but can't seem to find it now.

Is that first type of handling (merge by frequency) indeed implemented?

wxchan commented 7 years ago

@geoHeil no, you need to implement on your own. I am actually not sure what the status of categorical feature handling right now, seem guolinke has reverted it this afternoon. I will check it later.

guolinke commented 6 years ago

I feel like having this is CLI version is not need, and also too heavy. A tradeoff solution is providing a python script that can convert the string to the int type, which is much easier.

AbdealiLoKo commented 6 years ago

Having a python script would not be ideal for some projects because installations on clusters can be tedious and having more dependencies would not be a good idea. Especially if later lightgbm is modified to work with S3 or Redshift or other filesystems that would get messy as fileIO for a new filesys will have to be handled twice - in C and in python

guolinke commented 6 years ago

@AbdealiJK thanks for your thoughts. However, having this is not trivial. It will break many IO codes in current implementation, also has a large impact on IO speed.

For dependencies problem, you can use the binaries program, instead of scripts. Also, I think it doesn't need many dependencies. For example, you can use pure python to implement this.

AbdealiLoKo commented 6 years ago

I was actually looking to contribute this and realized that it indeed is not very trivial.

Agree that reading and writing can be handled by wrappers or pre-processing scripts as implementing this would not be worth the effort

jsh9 commented 6 years ago

I use lightgbm in Python, and I also would love for lightgbm to encode categorical features internally (i.e., "on the fly").

Having read the discussions above, I acknowledge that this is not trivial, but I think it can be implemented in the data preparation stage (i.e., when user creates the lightgbm.Dataset object), and the training stage does not need any changes.

Here are some details:

In this way, the lightgbm.DataSet object being passed to lightgbm.train is still a matrix with only numerical values, so the training subroutines does not need to be altered at all.

StrikerRUS commented 5 years ago

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.