Refactor evaluation; test; choose best model for each set of predictors

We learned that it's hard to run the evaluation that is fair to the baseline and in whose results we trust.

I think we need to do the following

refactor the code and add relevant unit tests
(limited?) hyperparameter tuning for a given prediction model
running different models (linear regression; regularized linear regression; support vector machine; possibly others) and report the best-performing one for each set of inputs. currently, we have hard-coded the hyperparams for each prediction task
it should also be flexible for different prediction tasks
when it's implemented, we can also think which kind of prediction model we should try additionally. tree/forest-based models could be interesting, but they take a while to give stable predictions with the amount of k we have

I think there is much scope for parallelization

across data and inputs (baseline variables, embeddings, embeddings + baseline; also different years of the same outcome)
across prediction models

Most of this work we can do on the fake data.

odissei-lifecourse / life-sequencing-dutch