msmbuilder / osprey

🦅Hyperparameter optimization for machine learning pipelines 🦅
http://msmbuilder.org/osprey
Apache License 2.0
74 stars 26 forks source link

osprey plot on the cli dumps entire database, ignores project_name variable #238

Open nhstanley opened 7 years ago

nhstanley commented 7 years ago

Hard to say whether this falls under "bug" or "would be nice to have", but when one runs the osprey plot config.yaml command from the command line, it just dumps the entire database, and ignores that you may have set a project_name in your config such as:

trials:
  uri: sqlite:///osprey-trials.db
  project_name: experiment_group1

I'm feeding osprey a lot of complex featurizations that are not really amenable to pipelining, so that means each one has to have its own config.yaml (unless there's a better way?). I suppose I could send each featurization to its own database instead but then that defeats the purpose of having a project_name option.

Alternatively, the plot command could just take the database as the input file and the split things out on a per-project basis. Not sure which is better.

If I have time I can look into making this happen, though I'm swamped with my projects right now and I don't know the osprey code very well. Just wanted to put this out there.

jeiros commented 6 years ago

I've run into the same problem when trying out different clustering algorithms in Pipelines. Is there a better solution than having each one on its own config file? The dump command also does this, dumping the whole database and not just the pertinent project_name as specified in the config file.

Not sure how to know to separate the results from each run since everything is mixed in the json file.

Edit: Thinking about this, my contribution from awhile ago where the hyperparameters went into columns instead of a dedicated parameters one might complicate things on this end.

brookehus commented 6 years ago

You can also use Osprey dump the results to a csv file (osprey dump -o csv > filename.csv), which is jankier, but you should be able to look through hyperparams and stuff by concatenating csv files (add a \n between them) and inserting a column for the run or clusterer at the beginning of each line. You could also contribute a ClusterSelector to MSMBuilder in the spirit of the FeatureSelector which was designed for the same purpose. See the FeatureSelector code here and an example using a pipeline here.

jeiros commented 6 years ago

Thanks for the help @brookehus , the feature selector looks very useful!