Open mtanco opened 3 years ago
The Credit Usage Kaggle data uses encodings which can be hard to read, using the Kaggle summary lookup table we can update the data as follows:
# Original Dataset: https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset import pandas as pd import numpy as np pd.set_option('display.max_columns', None) df = pd.read_csv('CreditCard-train.csv') # Gender (1=male, 2=female) df['SEX'] = np.select([df['SEX'] == 1, df['SEX'] == 2, ~df['SEX'].isin([1, 2])], ['Male', 'Female', None]) # (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown) df['EDUCATION'] = np.select([df['EDUCATION'] == 1, df['EDUCATION'] == 2, df['EDUCATION'] == 3, df['EDUCATION'] == 4, ~df['EDUCATION'].isin([1, 2, 3, 4])], ['Graduate School', 'University', 'High School', 'Other', None]) # MARRIAGE: Marital status (1=married, 2=single, 3=others) df['MARRIAGE'] = np.select([df['MARRIAGE'] == 1, df['MARRIAGE'] == 2, df['MARRIAGE'] == 3, ~df['MARRIAGE'].isin([1, 2, 3])], ['Married', 'Single', 'Other', None]) # TODO: Rename for consistency df = df.rename(columns={'PAY_0': 'PAY_1'}) # TODO: change PAY_X to categoricals # TODO: change target column for consistency df = df.rename(columns={'default.payment.next.month': 'DEFAULT_PAYMENT'}) # Change target column to boolean for h2o-3 data typing df['DEFAULT_PAYMENT'] = np.where(df['DEFAULT_PAYMENT'] == 1, True, False) print(df.head())
The Credit Usage Kaggle data uses encodings which can be hard to read, using the Kaggle summary lookup table we can update the data as follows: