[BUG] Google collab example not working: Multi-GPU distributed training with PyTorch

MagicBOTAlex commented 3 months ago

As the title says, I'm having problems running the example code, which is given here: Multi-GPU distributed training with PyTorch

I want to know if this error is only at my end or is it repeatable. I have also tried to set start method to "start" with no help.

Note: I'm kinda of new to Keras and I'm trying to learn.

SuryanarayanaY commented 3 months ago

Hi @MagicBOTAlex ,

Thanks for reaching out. I have tested the guide on a 2 GPU VM and observed its not working. Attached logs below.

With start_method="fork":

(tf2.13) suryanarayanay@surya-ubuntu20:~$ python distributed_training_with_torch.py
Running on 2 GPUs
Traceback (most recent call last):
  File "/home/suryanarayanay/distributed_training_with_torch.py", line 260, in <module>
    torch.multiprocessing.start_processes(
  File "/home/suryanarayanay/miniconda3/envs/tf2.13/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 202, in start_processes
    while not context.join():
              ^^^^^^^^^^^^^^
  File "/home/suryanarayanay/miniconda3/envs/tf2.13/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 163, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/suryanarayanay/miniconda3/envs/tf2.13/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 74, in _wrap
    fn(i, *args)
  File "/home/suryanarayanay/distributed_training_with_torch.py", line 232, in per_device_launch_fn
    setup_device(current_gpu_index, num_gpu)
  File "/home/suryanarayanay/distributed_training_with_torch.py", line 207, in setup_device
    torch.cuda.set_device(device)
  File "/home/suryanarayanay/miniconda3/envs/tf2.13/lib/python3.11/site-packages/torch/cuda/__init__.py", line 404, in set_device
    torch._C._cuda_setDevice(device)
  File "/home/suryanarayanay/miniconda3/envs/tf2.13/lib/python3.11/site-packages/torch/cuda/__init__.py", line 284, in _lazy_init
    raise RuntimeError(
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

With start_method="spawn":

(tf2.13) suryanarayanay@surya-ubuntu20:~$ python distributed_training_with_torch.py
Running on 2 GPUs
Running on 2 GPUs
Running on 2 GPUs
x_train shape: (60000, 28, 28, 1)
Traceback (most recent call last):
  File "/home/suryanarayanay/distributed_training_with_torch.py", line 260, in <module>
    torch.multiprocessing.start_processes(
  File "/home/suryanarayanay/miniconda3/envs/tf2.13/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 202, in start_processes
    while not context.join():
              ^^^^^^^^^^^^^^
  File "/home/suryanarayanay/miniconda3/envs/tf2.13/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 145, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGKILL

MagicBOTAlex commented 3 months ago

Hmm, you seem to have a different error compared to me. I don't know if it is the same or not.

fchollet commented 2 months ago

The error "Functional has no attribute parameters" seems to suggest you're not using the torch backend. Try to print keras.backend.backend() and see what you get. Note that the original notebook uses the env variable KERAS_BACKEND to set the backend.

MagicBOTAlex commented 2 months ago

I ran the demo again and ran keras.backend.backend() at the end. I think I'm getting new errors now

Using "fork":

Using "spawn":

The code i used without keras.backend.backend(): https://colab.research.google.com/github/keras-team/keras-io/blob/master/guides/ipynb/distributed_training_with_torch.ipynb

haifeng-jin commented 2 months ago

It runs in terminal but not in notebooks. Notebooks only support "fork" as the start_method, but it is not supported by torch with CUDA. "spawn" is supported by torch but not by notebook.

We will leave this issue open to see if there are new solutions coming up.

The current solution is just to download the notebook as .py file and run it from the terminal.

LarsKue commented 1 week ago

I'm running into the same issue, even in non-notebooks. Any updates so far @haifeng-jin?

keras-team / keras

[BUG] Google collab example not working: Multi-GPU distributed training with PyTorch #19346