msmbuilder / osprey

🦅Hyperparameter optimization for machine learning pipelines 🦅
http://msmbuilder.org/osprey
Apache License 2.0
74 stars 26 forks source link

Bug with FeatureSelector #221

Closed RobertArbon closed 7 years ago

RobertArbon commented 7 years ago

Trying to get the FeatureSelector to work with the pipeline in Osprey. It's not playing ball: Config file:

# osprey configuration file.
#---------------------------
# usage:
#  osprey worker config.yaml

estimator:
    # The model/estimator to be fit.

    # pick one of these ways to specify
      # path to a file
    # pickle: my-model.pkl
      # importable python class/instances
    # entry_point: msmbuilder.decomposition.tICA
    eval: |
        Pipeline([('featurizer', FeatureSelector(features=[('backbone_dihed', DihedralFeaturizer(types=['phi', 'psi'])),
                                                        ('residues_dihed', DihedralFeaturizer(types=['chi1', 'chi2', 'chi3', 'chi4'])),
                                                        ])),
                   ('variance_cut', VarianceThreshold()),
                   ('scaling', RobustScaler()),
                   ('tica', tICA(kinetic_mapping=True)),
                   ('cluster', MiniBatchKMeans()),
                   ('msm', MarkovStateModel(lag_time=80, verbose=False))])

    # for eval, a python package containing the estimator definitions
    eval_scope: msmbuilder

strategy:
    # the search section specifies the space of hyperparameters to search over
    # and the strategy for doing so

    # hyperopt's tree of parzen estimators http://hyperopt.github.io/hyperopt/
    # and random search are curently supported.
    name: random  # or gp, hyperopt_tpe
    #params: {}

search_space:
  # the search space is specified by listing the variables you want to
  # optimize over and their bounds for float and int typed variables,
  # or the possible choices for enumeration-typed variables.
  featurizer__which_feat:
    choices:
      - ['backbone_dihed']
      - ['residues_dihed']
      - ['backbone_dihed', 'residues_dihed']
    type: enum

  cluster__n_clusters:
    min: 100
    max: 500
    type: int       # from 10 to 100 (with inclusive endpoints)

  tica__lag_time:
    min: 100
    max: 400
    type: int

cv:
    name: shufflesplit
    params:
      n_iter: 5
      test_size: 0.5
      random_state: 42

dataset_loader:
  # specification of the dataset on which to train the models.
  name: mdtraj
  params:
    trajectories: ./Data/trajs/trajectory-*.xtc
    topology: ./Data/trajs/fs-peptide.pdb
    stride: 10

trials:
  # path to a databse in which the results of each hyperparameter fit
  # are stored any SQL database is suppoted, but we recommend using
  # SQLLite, which is simple and stores the results in a file on disk.
  # the string format for connecting to other database is described here:
  # http://docs.sqlalchemy.org/en/rel_0_9/core/engines.html#database-urls
  uri: sqlite:///osprey_test.db
  # if you want to save n > 1 projects in the same DB file, you can set
  # `project_name` to distringuish them:
  # project_name: name

Error message:

sqlalchemy.exc.StatementError: (builtins.TypeError) DihedralFeaturizer(sincos=True, types=['phi', 'psi']) is not JSON serializable [SQL: 'INSERT INTO trials_v3 (project_name, status, parameters, mean_test_score, mean_train_score, train_scores, test_scores, n_train_samples, n_test_samples, started, completed, elapsed, host, user, traceback, config_sha1) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)'] [parameters: [{'mean_train_score': None, 'elapsed': None, 'traceback': None, 'project_name': 'default', 'status': 'PENDING', 'n_train_samples': None, 'train_scores': None, 'config_sha1': '6e2665991eb5f40b9bc90cb9667103b5525e7800', 'user': 'robert_arbon', 'host': 'pub-211-66.rn-users.bris.ac.uk', 'parameters': {'tica__n_components': None, 'featurizer__features': OrderedDict([('backbone_dihed', DihedralFeaturizer(sincos=True, types=['phi', 'psi'])), ('residue ... (764 characters truncated) ... e_range': (25.0, 75.0), 'msm__lag_time': 80, 'cluster__n_init': 3, 'msm__sliding_window': True, 'msm__prior_counts': 0, 'msm__reversible_type': 'mle'}, 'test_scores': None, 'started': datetime.datetime(2017, 6, 2, 11, 47, 42, 979297), 'completed': None, 'mean_test_score': None, 'n_test_samples': None}]]

Is there anything I'm doing wrong? Any ideas what's going on?

Many thanks

Rob

RobertArbon commented 7 years ago

The problem is in osprey/execute_worker.py with:

    def build_full_params(xparams):
        # make sure we get _all_ the parameters, including defaults on the
        # estimator class, to save in the database
        params = clone(estimator).set_params(**xparams).get_params()
        params = dict((k, v) for k, v in iteritems(params)
                      if not isinstance(v, BaseEstimator) and
                      (k != 'steps'))

        return params

The problem is that the features parameter of the FeatureSelector is an ordered dictionary of the featurizer objects. This is not Json serializable it would seem (not an expert on this). So after running the above function on my pipeline object I get:

scaling__copy True
cluster__batch_size 100
tica__lag_time 1
tica__n_components None
cluster__n_init 3
cluster__tol 0.0
tica__shrinkage None
cluster__reassignment_ratio 0.01
msm__n_timescales None
cluster__init k-means++
scaling__with_scaling True
scaling__with_centering True
features__features OrderedDict([('backbone_dihed', DihedralFeaturizer(sincos=True, types=['phi', 'psi'])), ('residues_dihed', DihedralFeaturizer(sincos=True, types=['chi1', 'chi2', 'chi3', 'chi4'])), ('contacts', ContactFeaturizer(contacts='all', ignore_nonprotein=True,
         scheme='closest-heavy'))])
cluster__verbose 0
cluster__max_iter 100
msm__sliding_window True
cluster__n_clusters 8
msm__reversible_type mle
tica__kinetic_mapping True
scaling__quantile_range (25.0, 75.0)
features__which_feat ['backbone_dihed', 'residues_dihed', 'contacts']
cluster__compute_labels True
cluster__init_size None
msm__lag_time 80
cluster__max_no_improvement 10
cluster__random_state None
msm__prior_counts 0
msm__verbose False
msm__ergodic_cutoff on
variance_cut__threshold 0.0

Seems that for the purpose of Osprey the which_feat parameter is all that is needed.

I'm happy to fix and submit a pull request. A rather hacky fix would be:

 if not (isinstance(v, BaseEstimator) or isinstance(v, OrderedDict)) and
                      (k != 'steps'))

Something that tests for whether the parameter is Json serializable might be preferable, aside from trying to dump it (try: json.dumps(v)) I'm not sure what would be best.

cxhernandez commented 7 years ago

Thanks for the report @RobertArbon! I like the idea of making a test for whether the object is serializable or not. Apparently, the try/except method is the best way about it: https://stackoverflow.com/a/42033176

Would you be willing to submit a PR to add this function to utils.py?

RobertArbon commented 7 years ago

Yeah, sure, I'll get on this tomorrow.

cxhernandez commented 7 years ago

done in #223