Potential Multicollinearity Issue with Credit Simulator

socialfoundations / whynot

A Python sandbox for decision making in dynamics

MIT License

418 stars 43 forks source link

Potential Multicollinearity Issue with Credit Simulator #20

Closed shlomihod closed 4 years ago

shlomihod commented 4 years ago

In the GiveMeSomeCredit dataset there are three features that are very correlated to each other with Pearson's r ~ 0.98 (NumberOfTime30-59DaysPastDueNotWorse, NumberOfTimes90DaysLate, NumberOfTime60-89DaysPastDueNotWorse). This multicollinearity might cause to instability in fitting Logistic Regression model. Possible mitigation is to drop these features and include only their sum. The model on that dataset archives the same accuracy.

millerjohnp commented 4 years ago

Is this issue taken care of by lettings users choose subsets of the data for the simulator, mentioned in #21?

import whynot as wn
import whynot.gym as gym

# Only use  the first 5 features
features, labels = wn.credit.CreditData.features, wn.credit.CreditData.labels
features, labels = features[:, :5], labels[:, :5]
env = gym.make('Credit-v0', initial_state=wn.credit.State(features, labels))