microsoft / nni

An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.
https://nni.readthedocs.io
MIT License
14.02k stars 1.81k forks source link

How to re-activate gpus after sending CommandType.NoMoreTrialJobs in advisor #2233

Closed xingwangsfu closed 4 years ago

xingwangsfu commented 4 years ago

Hi, I'm trying to implement my own advisor. During the implementation, I met some problems When I use trialConcurrency =3 in multi-gpu environments, e.g., gpu0, gpu1 and gpu2, in some cases, there are no more hyper configs and I want to synchronize between gpus to generate more configs. So gpu0 sends CommandType.NoMoreTrialJobs to wait for gpu1 and gpu2 to finish their jobs. This trick works, but gpu0 will be inactive and never be used by NNI again.

My question is a) is there a way to re-activate gpu0 after sending CommandType.NoMoreTrialJobs? b) is there a better way to do the synchronizations between multi-gpus? I also tried to let gpu0 wait for gpu1 and gpu2 using while loop and time.sleep function. But this gave unexpected result.

Thanks.

QuanluZhang commented 4 years ago

@xingwangsfu here is how nnimanager interacts with tuner: every time nnimanager finds that a gpu becomes available again, it asks for a hyper config from tuner. But note that nnimanager only asks once for an available gpu. If tuner does not response with a hyper config, the available gpu will be wasted. In your case, concurrency will change from 3 to 2.

For your first question, how to re-activate gpu0 or how to increase concurrency from 2 to 3 again. When your tuner returns NoMoreTrialJobs, you should record in your code (our practice is using a counter). When new hyper configs are generated you should send one more hyper config through the following code:

ret = {
            'parameter_id': <the parameter id you get when you return NoMoreTrialJobs>,
            'parameter_source': 'algorithm',
            'parameters': <a new hyper config>
        }
send(CommandType.NewTrialJob, json_tricks.dumps(ret))

For your second question, you could use the approach above to control dependency between trials but not gpus, which I guess could satisfy your requirement.

xingwangsfu commented 4 years ago

@QuanluZhang Thanks a lot ! That solved my problem.