Open atc3 opened 1 month ago
The distribute_workers
flag is also accessible from the dacapo.yaml
file by adding this bit to it:
compute_context:
type: LocalTorch
config:
distribute_workers: True
Or maybe that's what you did?
Haha I totally forgot about configuring with the yaml file. I changed the default value in the LocalTorch
class but I think the end result was the same - in that setting distribute_workers
to True
stopped the crash from happening.
What do you think about just changing the default though? If distribute_workers
is false, what is the intended behavior on local machines?
@rhoadesScholar ?
Describe the bug
When running cosem_example.ipynb on a local workstation with GPUs, the validation step during training throws the following error:
If I directly call
validate_run
outside oftrain_run
, I get the same error:Happy to provide a full stack trace if it helps.
I tried to fix this issue by explicitly setting the torch multiprocessing method to use
spawn
but then I got a different error and decided not to go too deep into that hole. I then got around this error by enablingdistribute_workers
in theLocalTorch
compute context, and this somehow fixes the issue.To Reproduce
Just run cosem_example.ipynb on any local workstation with a GPU
Versions: