Questions about dataset Churn

Blank-z0 commented 1 year ago

Hi there, after reading your papers and reproduce of your codes on dataset Churn Modelling, I have some questions. Since kaggle have listed all features and their meanings, I think the classification of numerical and categorical features in the paper is not reasonable enough. I think these features should be categorical features instead of numerical features:

Gender ({0, 1} for {male, female})
NumOfProducts (The products they own. Integer values {1,2,3,4})
HasCrCard (Do they have a credit card or not {0,1})
IsActiveMember (How active member they are {0,1})
Tenure (The time of bond with company. Integer values {0,1,2,3,4,5,6,7,8,9,10})

By the way, I downloaded the datasets from the link you provided, in dataset Churn, the info.json mistakenly wrote "n_num_features" as 10 (the correct one should be 9).

jyansir commented 1 year ago

Thank you for your careful suggestions on the preprocessing of Churn dataset! Actually, we also found some data mistakes during experiment, including repeated columns (which may lead to a wrong n_num_features), or mentioned unreasonably processed features (like "gender" {0,1} as a numerical feature). The same data files can be acquired from data sources of Yandex's FT-Transformer and Numerical Embeddings, we found the used Churn data files have treated "gender" as a numerical feature (same as other data mistakes appeared in their provided files), thus for a fair comparison we followed their settings in the experiment.

Personally, I do agree the "gender" feature is a categorical one.

Blank-z0 commented 1 year ago

Thank you for your replay, I got it. I'll try doing some experiments to preprocess some features that may be categorical features.

jyansir / t2g-former

Questions about dataset Churn #4