Define modelling part of recipe

Extracted workflow from https://github.com/phenology/springtime/blob/91ba9abb5faf9fcc763fb37385fe96007d104426/docs/notebooks/mk_modelling_npn.ipynb very naively ported to a recipe-like format:

experiment:
  target: first_leaves_doy
  folds:
    # choose one of https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection
    default/simple:
      method: sklearn.model_selection.train_test_split
      test_size: 0.33
      random_state: 42

  transform:
    scale:
      method: sklearn.preprocessing.StandardScaler

  models:
    simple_linear:
      model: sklearn.linear_model.LinearRegression

    random_forest:
      model: sklearn.ensemble import RandomForestRegressor
      n_estimators = 300

    mixed_effects_random_forest:
      model: merf.MERF
      fixed_effects: None  # "the rest" after removing cluster, random effects, and target columns
      random_effects: ['tmax_365','tmin_365', 'prcp_365', 'srad_365', 'swe_365']
      clusters: "site_id"

    # explainable_boosting_machine:
    #   model: interpret.glassbox.ExplainableBoostingRegressor
    #   interactions: 0

    # mixed_effects_ebm:
    #   model: merf.MERF
    #   fixed_effects_model:
    #     class: interpret.glassbox.ExplainableBoostingRegressor
    #     options:
    #       interactions: 0

  scores:
    mean_absolute_error: sklearn.metrics.mean_absolute_error
    mean_squared_error: sklearn.metrics.mean_squared_error
    r2: sklearn.metrics.r2_score

  visualize:
    plot: standardplot

Alternatively, we could consider using pycaret:

experiment:
  type: regression  # --> pycaret.regression.RegressionExperiment
  setup:
    # data: None  # All of the above
    target: "first_leaves_doy"
    train_size: 0.75
    preprocess: false
    normalize: true
    normalize_method: zscore  # i.e. default
    fold_strategy: kfold  # i.e. default
    fold: 10
    fold_shuffle: true
    session_id: 123  # control randomness for reproducibility

compare_models:
  include:
    - 'lr'  # linear regression
    - 'rf'  # random forest regressor
    - 'xgboost'  # Extreme gradient boosting (bonus)
    - merf.MERF  # Must be instantiated before passing to pycaret; how to specify args?
    - interpret.glassbox.ExplainableBoostingRegressor  # Must be instantiated before passing to pycaret
  fit_kwargs:
    merf.MERF:   # This won't work out of the box
      fixed_effects: None  # "the rest" after removing cluster, random effects, and target columns
      random_effects: ['tmax_365','tmin_365', 'prcp_365', 'srad_365', 'swe_365']
      clusters: "site_id"
  cross_validation: False  # evaluate metrics on holdout set for now

A few notes:

With MERF you can specify the fixed_effects_model, so we might be able to use EBM that way?
With MERF, the syntax for fit is slightly different than with sklearn, it requires fixed/random/cluster columns
Looks like MERF and EBM could (relatively) easily be integrated with pycaret, but need to find a way to instantiate the models before passing them.

phenology / springtime

Define modelling part of recipe #135