odissei-lifecourse / life-sequencing-dutch

MIT License
0 stars 0 forks source link

PreFer baseline prediction using LISS #10

Open tanzir5 opened 2 months ago

tanzir5 commented 2 months ago

I am creating two baselines:

  1. A baseline based on just age and sex: A logistic regression predicts "0" always and achieves accuracy of 0.78 and terrible f1 (0) A random forest predicts both 0 and 1, achieves accuracy of 0.79 and not so terrible f1.
  2. A Gradient Boosted Random Forest (GBRF) that takes into account all data from train.csv and train_background.csv

Model 2 is in progress right now with details in the comments.

tanzir5 commented 2 months ago

Preprocessing (convert all variables to numerical):

The target is to make all features in the dataset numerical before putting it into a GBRF.

There are 5 different types of variables (according to CodeBook) and this is how I handle them:

  1. numeric: keep as is
  2. categorical: A mixture of one-hot-encoding and target encoding. For variables with more than 15 categories, I do target encoding. For others, one-hot-encoding with an additional category named "missing"
  3. response to open ended question: convert the feature to binary. Did it have some value or was it missing?
  4. date or time: drop features (only variables of this type are: the date or time of when the survey was started/ended)
  5. character [almost exclusively empty strings]: drop features (couldn't understand what these are by exploring, but they were always empty for the ones I checked)
tanzir5 commented 2 months ago

Preprocessing (imputation for nan, missing):

I use mean imputation for now. More advanced methods like regression imputation can be tried if needed in the future.

tanzir5 commented 2 months ago

Next issue to fix:

train_background does not have unique rows with respect to person. It is unique with respect to person + survey type + year . Think and make it unique for person before merging with train.csv

Run the code on AI-orion as my macbook is running out of memory.

tanzir5 commented 2 months ago

For train_background, only the latest data for each person was preserved. I trained the GBRF for 10% of the attributes and the performance is not too bad.

Next, figure out how to train on all variables or how to pick the variables with the highest predicting power for training.

tanzir5 commented 2 months ago

PreFer submission procedure does not let me use train_outcome and entire train set for encoding holdout_train set. This makes target_encoding impossible. Sent an email to the organizers.

TODO: attend office hour on May 6 if needed.

tanzir5 commented 1 month ago

The phase2 submissions were done on May 13. I did the following changes:

i) I used the variable 'cf20m130' as PROXY_TARGET for doing target_encoding since we do not have access to train_outcome during testing. cf20m130 is the question "Within how many years do you want to have a kid?"

ii) I uploaded a file containing the mean values of all the required variables in the training set since we cannot upload the train data but we can upload summary statistics. I use it to fill up the nan values. Point to be noted is that this contains variables that are created after the preprocessing involving one-hot-encoding and target-encoding, not the original variables.

iii) I ran 10 different models each containing 10% of the preprocessed variables. I use a weighted scheme involving both the importance of a variable according to the GBRF and the f1 score of the associated GBRF to rank original un-processed variables. Then I take the top 200 strongest variables and ignore all other variables from the raw dataset. I created the GBRF and submitted in phase 2.