phenology / springtime

Spatiotemporal phenology research with interpretable models
https://springtime.readthedocs.io
Apache License 2.0
3 stars 2 forks source link

Define modelling part of recipe #135

Closed Peter9192 closed 1 year ago

Peter9192 commented 1 year ago

Extracted workflow from https://github.com/phenology/springtime/blob/91ba9abb5faf9fcc763fb37385fe96007d104426/docs/notebooks/mk_modelling_npn.ipynb very naively ported to a recipe-like format:

experiment:
  target: first_leaves_doy
  folds:
    # choose one of https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection
    default/simple:
      method: sklearn.model_selection.train_test_split
      test_size: 0.33
      random_state: 42

  transform:
    scale:
      method: sklearn.preprocessing.StandardScaler

  models:
    simple_linear:
      model: sklearn.linear_model.LinearRegression

    random_forest:
      model: sklearn.ensemble import RandomForestRegressor
      n_estimators = 300

    mixed_effects_random_forest:
      model: merf.MERF
      fixed_effects: None  # "the rest" after removing cluster, random effects, and target columns
      random_effects: ['tmax_365','tmin_365', 'prcp_365', 'srad_365', 'swe_365']
      clusters: "site_id"

    # explainable_boosting_machine:
    #   model: interpret.glassbox.ExplainableBoostingRegressor
    #   interactions: 0

    # mixed_effects_ebm:
    #   model: merf.MERF
    #   fixed_effects_model:
    #     class: interpret.glassbox.ExplainableBoostingRegressor
    #     options:
    #       interactions: 0

  scores:
    mean_absolute_error: sklearn.metrics.mean_absolute_error
    mean_squared_error: sklearn.metrics.mean_squared_error
    r2: sklearn.metrics.r2_score

  visualize:
    plot: standardplot

Alternatively, we could consider using pycaret:

experiment:
  type: regression  # --> pycaret.regression.RegressionExperiment
  setup:
    # data: None  # All of the above
    target: "first_leaves_doy"
    train_size: 0.75
    preprocess: false
    normalize: true
    normalize_method: zscore  # i.e. default
    fold_strategy: kfold  # i.e. default
    fold: 10
    fold_shuffle: true
    session_id: 123  # control randomness for reproducibility

compare_models:
  include:
    - 'lr'  # linear regression
    - 'rf'  # random forest regressor
    - 'xgboost'  # Extreme gradient boosting (bonus)
    - merf.MERF  # Must be instantiated before passing to pycaret; how to specify args?
    - interpret.glassbox.ExplainableBoostingRegressor  # Must be instantiated before passing to pycaret
  fit_kwargs:
    merf.MERF:   # This won't work out of the box
      fixed_effects: None  # "the rest" after removing cluster, random effects, and target columns
      random_effects: ['tmax_365','tmin_365', 'prcp_365', 'srad_365', 'swe_365']
      clusters: "site_id"
  cross_validation: False  # evaluate metrics on holdout set for now

A few notes:

Peter9192 commented 1 year ago

mostly done, see https://github.com/phenology/springtime/blob/9b43c69a7420abf538b35b78008e6988bb46ea8b/src/springtime/recipes/model_comparison_usecase.yaml#L45-L88