[x] Make sure to separate into 80% train; 20% test before imputation
[x] Update script to add in closer to real data: clean/jobs_formod.csv (see slack message)
[x] I'd avoid onehotencoder and instead use pd.get_dummies --- definitely avoid handcoding and pd.get_dummies should work if you feed it a list of columns
[ ] i'd separate into two scripts: (1) preprocess that writes four objects: (1) training matrix features; (2) test matrix features; (3) training label; (4) test label and (2) modeling/evaluation script --- the latter should store models in a list and iterate over them
selected_cat_large
(the threshold should make the final unique values in the col around 100)outcome
variables[1]