Open tanzir5 opened 2 months ago
The target is to make all features in the dataset numerical before putting it into a GBRF.
There are 5 different types of variables (according to CodeBook) and this is how I handle them:
I use mean imputation for now. More advanced methods like regression imputation can be tried if needed in the future.
Next issue to fix:
train_background does not have unique rows with respect to person. It is unique with respect to person + survey type + year . Think and make it unique for person before merging with train.csv
Run the code on AI-orion as my macbook is running out of memory.
For train_background, only the latest data for each person was preserved. I trained the GBRF for 10% of the attributes and the performance is not too bad.
Next, figure out how to train on all variables or how to pick the variables with the highest predicting power for training.
PreFer submission procedure does not let me use train_outcome and entire train set for encoding holdout_train set. This makes target_encoding impossible. Sent an email to the organizers.
TODO: attend office hour on May 6 if needed.
The phase2 submissions were done on May 13. I did the following changes:
i) I used the variable 'cf20m130' as PROXY_TARGET for doing target_encoding since we do not have access to train_outcome during testing. cf20m130 is the question "Within how many years do you want to have a kid?"
ii) I uploaded a file containing the mean values of all the required variables in the training set since we cannot upload the train data but we can upload summary statistics. I use it to fill up the nan values. Point to be noted is that this contains variables that are created after the preprocessing involving one-hot-encoding and target-encoding, not the original variables.
iii) I ran 10 different models each containing 10% of the preprocessed variables. I use a weighted scheme involving both the importance of a variable according to the GBRF and the f1 score of the associated GBRF to rank original un-processed variables. Then I take the top 200 strongest variables and ignore all other variables from the raw dataset. I created the GBRF and submitted in phase 2.
I am creating two baselines:
Model 2 is in progress right now with details in the comments.