microsoft / SynapseML

Simple and Distributed Machine Learning
http://aka.ms/spark
MIT License
5.07k stars 831 forks source link

[Lightgbm] Support saving checkpoints during training #911

Open ce39906 opened 4 years ago

ce39906 commented 4 years ago

Is your feature request related to a problem? Please describe. Hi, @imatiach-msft , I'm using mmlspark lightgbm to train a ranking model that costs about 9 to 10 hours on a large dataset. My spark is deployed on yarn, there are several times the training job is failed caused by the lost node on my yarn cluster. Can mmlspark lightgbm support saving checkpoints during training so that we can continue training next time?

Describe the solution you'd like A clear and concise description of what you want to happen.

Additional context Add any other context or screenshots about the feature request here.

imatiach-msft commented 4 years ago

@ce39906 I remember I worked on this in the branch: https://github.com/imatiach-msft/mmlspark/commits/ilmat/save-interm but I got stuck because I couldn't save the model from a worker node, it could only be saved by spark from the driver. For me, that's the key part of figuring out how to enable this feature. I could do something hacky like try to convert the model to a file on one of the worker nodes and then save it, but I wasn't sure what would be the best API to use and how to save it so that it would work on all possible spark environments (eg different clusters like cloudera vs azure databricks vs HDI vs spark standalone/clouds like AWS vs Azure vs Cloudera etc have very different APIs and ways to save files).

ce39906 commented 4 years ago

Thanks for your reply, I'll take a look.