Closed shlomihod closed 4 years ago
Is this issue taken care of by lettings users choose subsets of the data for the simulator, mentioned in #21?
import whynot as wn
import whynot.gym as gym
# Only use the first 5 features
features, labels = wn.credit.CreditData.features, wn.credit.CreditData.labels
features, labels = features[:, :5], labels[:, :5]
env = gym.make('Credit-v0', initial_state=wn.credit.State(features, labels))
In the GiveMeSomeCredit dataset there are three features that are very correlated to each other with Pearson's r ~ 0.98 (
NumberOfTime30-59DaysPastDueNotWorse
,NumberOfTimes90DaysLate
,NumberOfTime60-89DaysPastDueNotWorse
). This multicollinearity might cause to instability in fitting Logistic Regression model. Possible mitigation is to drop these features and include only their sum. The model on that dataset archives the same accuracy.