Open tund opened 7 years ago
Here is my plan. Please give me your feedback:
fit
function should return a training_report
object and predict
function should return a predict_report
object. From these objects, we can save results to further investigate.One more thing is GridSearch
in sklearn is a hard code technique. We code it, run it and wait for it ... We cannot change anything until it stops. For example, we usually declare n_jobs=-1 (mean we use all CPUs). We are running and others complain. What we can do is stop it and run it again with n_jobs=XX. We cannot change n_jobs when it's running.
I think what we need is a central app. This app will deploy tasks (a task = a tuple of parameters we want to grid search). We can even feed another dataset when it does not finish the previous one. It differs from hard code is that we can config whenever we want.
@ Khanh: "We can even feed another dataset when it does not finish the previous one. It differs from hard code is that we can config whenever we want." What do you mean here? "We can even feed another dataset when it does not finish the previous one. " Do you mean that the new dataset will be in queue?
Yes, it will be in a queue. For example, we have 3 servers. They are running dataset A, B, C respectively. Then we have a new dataset D. At that time we have not yet finished preprocessing or still not find a new dataset. Then, we cannot feed the fourth dataset at the time we start scripts on 3 servers. Or, we believe C will finish first, but in fact, after running a few hours, we see that A have finished 80% tasks. If we decided that D will run after C at the beginning, we failed.
Tuning hyperparameters of models takes a lot of time and sometimes too long to wait for the final results. We desire to have 3 things:
These should be considered urgent features.