Extension of `lightning` DDP support

optuna / optuna-integration

Extended functionalities for Optuna in combination with third-party libraries.

MIT License

38 stars 30 forks source link

Motivation

I am attempting to use optuna for hyperparameter optimization of a complex, lightning based deep learning framework. It is essential for this framework to run in a distributed setting. In the distributed lightning integration example, ddp_spawn is used as a strategy, which is strongly discouraged by lightning because of speed and flexibility concerns (the inability to use a large value of num_workers without bottlenecking for example, which is essential for my use case). Attempting to use the regular DDP strategy however, results in optuna generating a different set of hyperparameters for each RANK, since my optuna main script is repeatedly called. I have considered running my distributed main script in a subprocess started in the objective function, but this would not allow me to use the PytorchLightningPruningCallback since I can't reliable pass this object to the subprocess.

Description

My suggestion is to add a way for optuna to run with regular DDP. Perhaps by tracking whether DDP is being used in the storage, so that when study.optimize is called, the correct trial is produced and the trial suggest methods will return the same hyperparameters across ranks. I do not know enough of the internal workings of optuna to know if it is feasible to implement this. Is this something that can be supported in the future?

Alternatives (optional)

No response

Additional context (optional)

No response

You can use optuna.integration.TorchDistributedTrial for ddp mode. There is an example https://github.com/optuna/optuna-examples/blob/main/pytorch/pytorch_distributed_simple.py

When I want to use lighting, there is also example, but it is only in ddp_spawn mode https://github.com/optuna/optuna-examples/blob/main/pytorch/pytorch_lightning_simple.py

When I try to run it in simple ddp mode with TorchDistributedTrial and lighting, hyperparameters are synchronized, but:

1) PytorchLightingCallback crashes since it uses trial._study inside, which is not implemented in TorchDistributedTrial 2) For some reason running in multimode with TorchDistributor().run() gives an unpredictable number of trials, and I can't find a way to debug it. I tried to use optuna.logging.set_verbosity(optional.logging.DEBUG), but there are no additional logs (I excepted to see logs of some decisions for pruning).

If someone has a working example of using lighting with optuna in simple ddp mode (which is recommended by lighting), it would be great.

optuna / optuna-integration