Continue training after initial training gracefully ends

dcapeluto commented 2 years ago

The problem: It is hard to know an adequate total_time_limit for a specific training scenario in AutoML. The time limit depends on data size, training machine capability and algorithms chosen. Selecting a seemingly outrageously high limit, could end up not being enough for AutoML to train all default algorithms, and cutting the training short. AutoML moves on to Ensemble/Stack steps with a limited set of trained default algorithms.

Inconvenient workaround: Obviously this can be worked around by retraining from scratch with a higher total_time_limit, but if there were several hours or days of already trained models, those cannot be used towards continuing the training.

Solution: As such, if there were a feature to "continue training" after training gracefully ends, and pick up where it left off (only for cases where some training was skipped due to time constraints) it would make MLJAR more flexible.

Maybe a workaround?: Is there a manual way of achieving this (I suspect tweaking params.json, progress.json and existing folders), and anyone knows how's done, I could potentially write a method that makes the adjustments to these json files, and commit the code for review and merge.

PS: Overall, amazing AutoML library. Very easy to use and highly recommended for any "citizen data scientist" looking to get useful results without acquiring a data science degree first... ;)

AkshayNovacene commented 7 months ago

Hey @dcapeluto were you able to find a solution for this? is it possible to retrain() or "continue training" in Auto ML?

dcapeluto commented 7 months ago

@AkshayNovacene no unfortunately. I did a little bit of trying to mimic the internal files to simulate an interruption in training so that it would continue instead of thinking that it finished, but couldn't get it to work. I had to restart every time.

AkshayNovacene commented 7 months ago

@AkshayNovacene no unfortunately. I did a little bit of trying to mimic the internal files to simulate an interruption in training so that it would continue instead of thinking that it finished, but couldn't get it to work. I had to restart every time.

Ah, that's unfortunate. Do you think there's a workaround for this?

dcapeluto commented 7 months ago

Yes. See the files created while training. They contain the status and progress of the training. You can probably mimic training interruption as opposed to training completion and when you relaunch the training, the program would then "see" that the training was interrupted and pick up where it left off, assuming you also increased the training time. @pplonski probably would know off the top of his head how to do this.

pplonski commented 7 months ago

Hi @AkshayNovacene,

As @dcapeluto wrote, you can try to mimic the training interruption with longer training time. Maybe there will be required change in tuner to generate more unique configurations of hyper-parameters.

AkshayNovacene commented 7 months ago

@dcapeluto and @pplonski Thanks for the replies. But the automl library throws an error: "Cannot set directory for AutoML. Directory '{path}' is not empty. when we try to use a path with the models already trained. I know it is too much to ask but is it possible to help me with an example of how I could mimic to interrupt training and make it continue?

mljar / mljar-supervised

Continue training after initial training gracefully ends #568