titu1994 / pyshac

A Python library for the Sequential Halving and Classification algorithm
http://titu1994.github.io/pyshac/
MIT License
21 stars 7 forks source link

existing dataset of hyperparameters + scores #2

Closed ahundt closed 6 years ago

ahundt commented 6 years ago

Hey, I recently did a search with bayesian optimization so I have a few thousand hyperparameter sets with scores. I'd be interested to feed that data in then make a few new predictions to see how it does. Is that possible?

titu1994 commented 6 years ago

Hmm, that's an interesting question. Generally, I use pyshac to sequentially search smaller search spaces.

The dataset format is highly simplistic, so one can manually copy paste files in Excel and export as a csv to get the job done.

The thing is, right now, the library ignores earlier trained data completely and moves forward to the next data point if you continue training.

The sequential nature of training comes from the fact that every subsequent sampling stage has a much harder task, trying to find a particular sample that satisfies all the checks of the previous classifiers. Given historic data, it may not follow this assumption closely.

Still, I'll look into how to incorporate older data into the classifiers. If you have any idea, do tell ! I can probably implement it in a week if it's simple enough.

ahundt commented 6 years ago

Ah I see I read a little more about the design and since each classifier decides how to cut the search space in half, many examples must be utilized to even get one output. With that design, figuring out how to use existing data is definitely tricky... Hmm, the underlying paper is definitely a curious design.

titu1994 commented 6 years ago

SHAC is quite efficient in the sense that you need a batch of samples which can be evaluated in parallel, and this batch is used to train classifiers for the next stage.

I gave your requirement some though these past few days. From my view point, given your Bayesian optimization dataset, as long as it details the hyper parameters and the subsequent evaluation metric, I could sort the dataset based on the criterion, and then feed batched slices of these results to subsequent classifiers.

In theory, this should provide a similar result to training on samples which pass through each classifier.

If that makes sense, I can start implementing a prototype. What's your opinion ?

ahundt commented 6 years ago

Well, I don't want to unreasonably take up your time with this since it can always be run from scratch. I was only curious if it might be something simple, but it sounds like a more complex architecture change for what is most likely a super niche use case which can be overcome by simply running again. Thanks though for giving it some thought. 👍

titu1994 commented 6 years ago

I wouldn't exactly designate this as super niche. A lot of researchers often run hyper parameter optimization on their models, and as such create schema's of valuable information on which SHAC can possibly find somewhat better settings leading to better results.

In all, I won't prioritize this since I have a few other projects to focus on as well, but it is definitely a useful extension, to make use of external data that may be valuable for further parameter tuning.

titu1994 commented 6 years ago

@ahundt I was finally able to get external datasets working with PySHAC. Seems to produce fast, useful results without too much effort.

The changelog is in the most recent tag - https://github.com/titu1994/pyshac/releases/tag/v0.3.0

For the guide on how to setup external data training - http://titu1994.github.io/pyshac/external-dataset-training/

On a side note, I really gotta delete that rubbish readthedocs page. They seem to be outdated, and poor in quality.

ahundt commented 6 years ago

oh cool! sounds awesome