jpmml / jpmml-lightgbm

Java library and command-line application for converting LightGBM models to PMML
GNU Affero General Public License v3.0
174 stars 58 forks source link

Mapped Value in Datafield #17

Closed sarthakchhillar12 closed 5 years ago

sarthakchhillar12 commented 5 years ago

I am getting features in pmml file as - DataField name="screen_size" optype="categorical" dataType="integer"\ Value value="1" Value value="3" Value value="4" Value value="5"

where as the original screen_size feature value looks like "1440x900" " 412x732" "360x640" "414x736 "

similar thing is happening for all features

vruusmann commented 5 years ago

DataField name="screen_size" optype="categorical" dataType="integer"

LightGBM represents categorical features using the categorical integer type, so everything is 100% correct.

The conversion from categorical strings ("1440x900") to categorical integers (1) happens somewhere else. Looks like this information is simply not made available to the JPMML-LightGBM library.

How is the LIghtGBM model trained? Please share a reproducible example.

sarthakchhillar12 commented 5 years ago

@vruusmann A more reproducible example :

`

import numpy as np
import pandas as pd
import lightgbm as lgb
categorical_features =  ['c1','c2','c3','c4'] 

matrix = np.random.choice(['a','b','c','d'],400*4)
matrix.resize(400,4)
df = pd.DataFrame(matrix,columns = categorical_features)
df['target'] = np.random.binomial(1,0.5,df.shape[0])

def as_cat(df):
    for feature_name in categorical_features:
        df[feature_name] = df[feature_name].astype('category')
    return df

def model_sample(X_train):
    X_train = as_cat(X_train)
    X_train["weight"]=1      
    valid_features=categorical_features
    parameters={'num_leaves': 50, 
                'max_cat_threshold': 200, 
                'n_jobs': -1, 
                'colsample_bytree':  0.9995, 
                'verbose': 1, 
                'lambda_l1': 2, 
                'learning_rate': 0.5, 
                'lambda_l2': 4, 
                'min_child_weight': 0.722, 
                'min_split_gain': 0.0585, 
                'cat_l2': 136.125, 
                'min_data_in_leaf': 25, 
                'objective': 'binary', 
                'cat_smooth': 10,
                'max_bin':200,
                'bagging_fraction': 1, 
                'max_depth': 10, 
                'metric': ['auc', 'binary_logloss'],
                'boosting_type': 'gbdt'}

    lgb_train_data = lgb.Dataset(X_train[valid_features], label=X_train.target, 
                                      categorical_feature=categorical_features,weight=X_train["weight"].values,
                                      free_raw_data=False)
    print "running "

    model = lgb.train(parameters,
    lgb_train_data,verbose_eval=0,
    num_boost_round=10,valid_sets=[lgb_train_data], 
                         valid_names=['train'] )

    return model

df = as_cat(df)
model = model_sample(df)
model.save_model('/data/rpm/vl2a/models/random.txt')`

Adding a screen shot of how actual data looks and the pmml generated. The pmml does not have any mapping from actual data to categorical integer

randompmml.txt screenshot 2018-12-21 at 5 45 06 pm screenshot 2018-12-21 at 5 48 46 pm

sarthakchhillar12 commented 5 years ago

@vruusmann Please Fix the problem exists when using light gbm version 2.2 and not when using 2.1 Lightgbm version 2.2 model output contains the following extra lines

`

end of trees

parameters:
[boosting: gbdt]
[objective: binary]
[metric: binary]....
....[gpu_platform_id: -1]
[gpu_device_id: -1]
[gpu_use_dp: 0]

end of parameters

` Look at the diff model diff https://www.diffchecker.com/HgNQ83Ik pmml diff https://www.diffchecker.com/Y94QwzxT

the problem is caused by these extra lines .

Attaching the various files involved
model_lightgbmversion2_1.txt model_lightgbmversion2_2.txt pmml_lightgbmversion2_1.txt pmml_lightgbmversion2_2.txt

yairdata commented 4 years ago

i am using lightgbm 2.3.0 with jpmml-lightgbm version 1.3.0 and got the exact same issue. what is the resolution for this replacement of the real string values of the categorical features with integers when using continue learning ?