Closed NickleDave closed 6 months ago
Another quick fix might be to default to single device training for now since this is fine for most of our models
If someone needs all the GPUs we should document, "this is a case where you'll need to move from the CLI to using vak in a script"
I can't actually figure out off it's easy to just tell lightning "use a single GPU", like if there's a string I can pass in to "strategy"
https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.strategies.SingleDeviceStrategy.html
Fixed by #752
because the 'DDP' strategy spawns multiple processes
https://lightning.ai/docs/pytorch/stable/accelerators/gpu_intermediate.html#distributed-data-parallel
and this ends up causing vak to create multiple results directories (one created by each process), and then not looking in the correct results directory to find checkpoints
A workaround for now is to set an environment variable to force
vak
/lightning
to run on a single GPUAn annoyingly dumb fix for this might be to just make
learncurve
one giant function instead of callingtrain
theneval
? Not sure I can engineer something smarter (i.e. an alternative strategy) that would make the cli work relatively painlessly