Multiprocessing error during validation in LocalTorch compute context

atc3 commented 1 month ago

Describe the bug

When running cosem_example.ipynb on a local workstation with GPUs, the validation step during training throws the following error:

...
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

If I directly call validate_run outside of train_run, I get the same error:

from dacapo import validate_run

validate_run("cosem_distance_run_4nm", 2000)

Creating FileConfigStore:
    path: /home/chena2@hhmi.org/dacapo/configs
Creating local weights store in directory /home/chena2@hhmi.org/dacapo
Retrieving weights for run cosem_distance_run_4nm, iteration 2000
Validating run cosem_distance_run_4nm at iteration 2000...
Creating FileStatsStore:
    path    : /home/chena2@hhmi.org/dacapo/stats
Validating run cosem_distance_run_4nm on dataset jrc_hela-2_recon-1/labels/groundtruth/crop6/[mito]_gt_jrc_hela-2_recon-1/labels/groundtruth/crop6/mito_s1_uint8_None_4nm
validation inputs already copied!
Predicting with input size (2304, 2304, 2304), output size (848, 848, 848)
Total input ROI: [11272:13728, 872:3328, 11352:13808] (2456, 2456, 2456), output ROI: [12000:13000, 1600:2600, 12080:13080] (1000, 1000, 1000)
Running blockwise prediction with worker_file:  /home/chena2@hhmi.org/dacapo-ml/dacapo/blockwise/predict_worker.py
Running blockwise with worker_file:  /home/chena2@hhmi.org/dacapo-ml/dacapo/blockwise/predict_worker.py
Using compute context: LocalTorch(distribute_workers=False, _device=None, oom_limit=4.2)
ERROR:daisy.worker:worker (hostname=10.101.50.108:port=35859:task_id=predict_worker2024-09-25_16-08-03:worker_id=2) received exception: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

Happy to provide a full stack trace if it helps.

I tried to fix this issue by explicitly setting the torch multiprocessing method to use spawn but then I got a different error and decided not to go too deep into that hole. I then got around this error by enabling distribute_workers in the LocalTorch compute context, and this somehow fixes the issue.

To Reproduce

Just run cosem_example.ipynb on any local workstation with a GPU

Versions:

OS: Ubuntu 22.04
CUDA Version: 12.2
3 x NVIDIA RTX A5000, 24 GB memory each

vaxenburg commented 1 month ago

The distribute_workers flag is also accessible from the dacapo.yaml file by adding this bit to it:

compute_context:
  type: LocalTorch
  config:
    distribute_workers: True

Or maybe that's what you did?

atc3 commented 1 month ago

Haha I totally forgot about configuring with the yaml file. I changed the default value in the LocalTorch class but I think the end result was the same - in that setting distribute_workers to True stopped the crash from happening.

What do you think about just changing the default though? If distribute_workers is false, what is the intended behavior on local machines?

vaxenburg commented 1 month ago

@rhoadesScholar ?

janelia-cellmap / dacapo

Multiprocessing error during validation in LocalTorch compute context #298