Clean Credit Usage Data

The Credit Usage Kaggle data uses encodings which can be hard to read, using the Kaggle summary lookup table we can update the data as follows:


# Original Dataset: https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset

import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)

df = pd.read_csv('CreditCard-train.csv')

# Gender (1=male, 2=female)
df['SEX'] = np.select([df['SEX'] == 1, df['SEX'] == 2, ~df['SEX'].isin([1, 2])],
                      ['Male', 'Female', None])

# (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)
df['EDUCATION'] = np.select([df['EDUCATION'] == 1, df['EDUCATION'] == 2, df['EDUCATION'] == 3, df['EDUCATION'] == 4,
                             ~df['EDUCATION'].isin([1, 2, 3, 4])],
                            ['Graduate School', 'University', 'High School', 'Other', None])

# MARRIAGE: Marital status (1=married, 2=single, 3=others)
df['MARRIAGE'] = np.select([df['MARRIAGE'] == 1, df['MARRIAGE'] == 2, df['MARRIAGE'] == 3,
                            ~df['MARRIAGE'].isin([1, 2, 3])],
                           ['Married', 'Single', 'Other', None])

# TODO: Rename for consistency
df = df.rename(columns={'PAY_0': 'PAY_1'})

# TODO: change PAY_X to categoricals

# TODO: change target column for consistency
df = df.rename(columns={'default.payment.next.month': 'DEFAULT_PAYMENT'})

# Change target column to boolean for h2o-3 data typing
df['DEFAULT_PAYMENT'] = np.where(df['DEFAULT_PAYMENT'] == 1, True, False)

print(df.head())

h2oai / wave-apps

Clean Credit Usage Data #28