Closed sarthakchhillar12 closed 5 years ago
DataField name="screen_size" optype="categorical" dataType="integer"
LightGBM represents categorical features using the categorical integer type, so everything is 100% correct.
The conversion from categorical strings ("1440x900"
) to categorical integers (1
) happens somewhere else. Looks like this information is simply not made available to the JPMML-LightGBM library.
How is the LIghtGBM model trained? Please share a reproducible example.
@vruusmann A more reproducible example :
`
import numpy as np
import pandas as pd
import lightgbm as lgb
categorical_features = ['c1','c2','c3','c4']
matrix = np.random.choice(['a','b','c','d'],400*4)
matrix.resize(400,4)
df = pd.DataFrame(matrix,columns = categorical_features)
df['target'] = np.random.binomial(1,0.5,df.shape[0])
def as_cat(df):
for feature_name in categorical_features:
df[feature_name] = df[feature_name].astype('category')
return df
def model_sample(X_train):
X_train = as_cat(X_train)
X_train["weight"]=1
valid_features=categorical_features
parameters={'num_leaves': 50,
'max_cat_threshold': 200,
'n_jobs': -1,
'colsample_bytree': 0.9995,
'verbose': 1,
'lambda_l1': 2,
'learning_rate': 0.5,
'lambda_l2': 4,
'min_child_weight': 0.722,
'min_split_gain': 0.0585,
'cat_l2': 136.125,
'min_data_in_leaf': 25,
'objective': 'binary',
'cat_smooth': 10,
'max_bin':200,
'bagging_fraction': 1,
'max_depth': 10,
'metric': ['auc', 'binary_logloss'],
'boosting_type': 'gbdt'}
lgb_train_data = lgb.Dataset(X_train[valid_features], label=X_train.target,
categorical_feature=categorical_features,weight=X_train["weight"].values,
free_raw_data=False)
print "running "
model = lgb.train(parameters,
lgb_train_data,verbose_eval=0,
num_boost_round=10,valid_sets=[lgb_train_data],
valid_names=['train'] )
return model
df = as_cat(df)
model = model_sample(df)
model.save_model('/data/rpm/vl2a/models/random.txt')`
Adding a screen shot of how actual data looks and the pmml generated. The pmml does not have any mapping from actual data to categorical integer
@vruusmann Please Fix the problem exists when using light gbm version 2.2 and not when using 2.1 Lightgbm version 2.2 model output contains the following extra lines
`
end of trees
parameters:
[boosting: gbdt]
[objective: binary]
[metric: binary]....
....[gpu_platform_id: -1]
[gpu_device_id: -1]
[gpu_use_dp: 0]
end of parameters
` Look at the diff model diff https://www.diffchecker.com/HgNQ83Ik pmml diff https://www.diffchecker.com/Y94QwzxT
the problem is caused by these extra lines .
Attaching the various files involved
model_lightgbmversion2_1.txt
model_lightgbmversion2_2.txt
pmml_lightgbmversion2_1.txt
pmml_lightgbmversion2_2.txt
i am using lightgbm 2.3.0 with jpmml-lightgbm version 1.3.0 and got the exact same issue. what is the resolution for this replacement of the real string values of the categorical features with integers when using continue learning ?
I am getting features in pmml file as - DataField name="screen_size" optype="categorical" dataType="integer"\ Value value="1" Value value="3" Value value="4" Value value="5"
where as the original screen_size feature value looks like "1440x900" " 412x732" "360x640" "414x736 "
similar thing is happening for all features