Save hyperparameters and results, print best-so-far when tuning the model

tund / male

MAchine LEarning (MALE)

GNU General Public License v3.0

4 stars 1 forks source link

Save hyperparameters and results, print best-so-far when tuning the model #24

Open tund opened 7 years ago

tund commented 7 years ago

Tuning hyperparameters of models takes a lot of time and sometimes too long to wait for the final results. We desire to have 3 things:

Save hyperparameter settings and their results
Display the results when varying hyperparameters, this would help to study the model behaviors.
Print and save to file the best result so far while tuning.

These should be considered urgent features.

khanhndk commented 7 years ago

Here is my plan. Please give me your feedback:

Find whether `GridSearchCV' has any callback. We can use these callback to know the progress and get what we want (print best so far, save results).
Edit framework: fit function should return a training_report object and predict function should return a predict_report object. From these objects, we can save results to further investigate.
Build an app to read report file and visualize it. It is very difficult to build from scratch. Now, I suggest we use Excel to visualize it. Pivot Table is a good function to do that.

khanhndk commented 7 years ago

One more thing is GridSearch in sklearn is a hard code technique. We code it, run it and wait for it ... We cannot change anything until it stops. For example, we usually declare n_jobs=-1 (mean we use all CPUs). We are running and others complain. What we can do is stop it and run it again with n_jobs=XX. We cannot change n_jobs when it's running. I think what we need is a central app. This app will deploy tasks (a task = a tuple of parameters we want to grid search). We can even feed another dataset when it does not finish the previous one. It differs from hard code is that we can config whenever we want.

SeaOtter commented 7 years ago

@ Khanh: "We can even feed another dataset when it does not finish the previous one. It differs from hard code is that we can config whenever we want." What do you mean here? "We can even feed another dataset when it does not finish the previous one. " Do you mean that the new dataset will be in queue?

khanhndk commented 7 years ago

Yes, it will be in a queue. For example, we have 3 servers. They are running dataset A, B, C respectively. Then we have a new dataset D. At that time we have not yet finished preprocessing or still not find a new dataset. Then, we cannot feed the fourth dataset at the time we start scripts on 3 servers. Or, we believe C will finish first, but in fact, after running a few hours, we see that A have finished 80% tasks. If we decided that D will run after C at the beginning, we failed.