BUG: Running lightning with default strategy 'DDP' breaks learncurve function

NickleDave commented 7 months ago

because the 'DDP' strategy spawns multiple processes

https://lightning.ai/docs/pytorch/stable/accelerators/gpu_intermediate.html#distributed-data-parallel

This Lightning implementation of DDP calls your script under the hood multiple times with the correct environment variables:

and this ends up causing vak to create multiple results directories (one created by each process), and then not looking in the correct results directory to find checkpoints

ValueError: did not find a single checkpoint path, instead found:
[]

A workaround for now is to set an environment variable to force vak / lightning to run on a single GPU

$ export CUDA_VISIBLE_DEVICES=0
$ vak learncurve config.toml

An annoyingly dumb fix for this might be to just make learncurve one giant function instead of calling train then eval? Not sure I can engineer something smarter (i.e. an alternative strategy) that would make the cli work relatively painlessly

NickleDave commented 7 months ago

Another quick fix might be to default to single device training for now since this is fine for most of our models

If someone needs all the GPUs we should document, "this is a case where you'll need to move from the CLI to using vak in a script"

I can't actually figure out off it's easy to just tell lightning "use a single GPU", like if there's a string I can pass in to "strategy"

https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.strategies.SingleDeviceStrategy.html

NickleDave commented 6 months ago

Fixed by #752

vocalpy / vak

BUG: Running lightning with default strategy 'DDP' breaks learncurve function #742