Split dealing with CSV from training and dataset cleaning in fit_dataset

titu1994 / pyshac

A Python library for the Sequential Halving and Classification algorithm

http://titu1994.github.io/pyshac/

MIT License

21 stars 7 forks source link

Split dealing with CSV from training and dataset cleaning in fit_dataset #9

Open KOLANICH opened 5 years ago

KOLANICH commented 5 years ago

Hi again.

I'm implementing resumpltion and metaoptimization in UniOpt, so for pyshac backend I need an API to inject points into the optimizer from memory: a) bulk injection (needed for resumation, may be less efficient since it is done rarely); b) individual point injection (needed for metaoptimization and should be efficient, so incremental learning).

I guess fit_dataset does this, but

it deals with csv files, so I cannot use it as it is
it does too much work, so I don't want to recreate it

It would be better if the stuff worked not only with arrays, but also with iterators.

titu1994 commented 5 years ago

I'm a bit confused as to what "API to inject points into the optimizer from memory" means.

Do you wish to add samples to the engine without going through the fit dataset procedure ?

KOLANICH commented 5 years ago

I want to add points (hyperparams, loss) into the engine while it is operating.

Most of optimizers work in the "predict n points with the model -> evaluate function in those points -> add the points and function values in these points to the model and get the model ready for the next iteration" loop

I want to add points to a model directly, without first 2 steps. This is meant to enable

resumation. I just save all points and on resumation just inject them into an empty optimizer of any type for which injection is implemented.
metaoptimization. I generate n points with each optimizer, evaluate the function in all of them, and then inject the points into each optimizer. Different optimizers are good on different tasks, or maybe even on different stages of optimization of the same task, so we may get best of all of them.

titu1994 commented 5 years ago

So your request is to decouple the dataset loading, reading and shuffling inside fit_dataset into two distinct steps : dataset / generator manipulation and engine training so that you could in theory feed in numpy arrays or a dataset path and both operations would act in a similar manner.

Sounds feasible, but I won't be able to work on it for a few weeks.