Closed avanikop closed 6 months ago
Thank you @avanikop for catching this and helping me track down the source of the issue.
As you pointed out by email
One (unrelated) thing I noticed is, it creates two "result_datestamp_timestamp" folder during the same run and the folder with the later timestamp has no max val checkpoint file in the checkpoints folder.
I think you were talking about what happens when you run vak train
but you made me realize that the same issue I'm seeing in #742 might also be causing this error with predict: it's because lightning is defaulting to a "distributed" strategy.
I can verify that lightning running in distributed mode is indeed the source of the bug 😩 If I run on a machine with multiple GPUs I reproduce your bug with predict.
A workaround for now is to do the following before you run vak:
export CUDA_VISIBLE_DEVICES=0
Basically you force lightning to not run in distributed mode by making it see there's only one GPU.
If I do this then vak predict
runs without this error, and the same workaround applies for vak learncurve
, and presumably vak train
Thank you for pointing out you were seeing an extra folder get generated for train -- I thought that was only happening with learncurve. You got me to the root of the problem.
My guess for what's going on is that something about how lightning runs in distributed causes us to end up with some keys missing from the dictionary returned by the predict
method.
Just so it's clear what I did: you can do this in two lines
$ export CUDA_VISIBLE_DEVICES=0
$ vak predict your_config.toml
or one (in a way that doesn't "export" the variable to the environment)
$ CUDA_VISIBLE_DEVICES=0 vak predict your_config.toml
Since we're seeing it in train + predict, this means I need to fix this bug sooner rather later.
I've been needing to do this for my own experiments anyways.
The fix will be something like adding a gpus
option that gets passed directly to lightning.Trainer
and then defaulting to a single device.
I will raise a separate issue with a fix
Fixed by #752
Facing problems with vak predict step. Tried setting save_net_output= true as well as = false and got the same error: I dont know if this belongs on vocalpy or here, but I have been having problems during the vak predict predict.toml step. It always gives this error:
It is not file-specific - I changed the dataset and got the same result.
Working on cluster with multiple GPUs Possible solution suggested already: $ CUDA_VISIBLE_DEVICES=0 vak predict my_config seems to work for now