Open MagicBOTAlex opened 3 months ago
Hi @MagicBOTAlex ,
Thanks for reaching out. I have tested the guide on a 2 GPU VM and observed its not working. Attached logs below.
With start_method="fork"
:
(tf2.13) suryanarayanay@surya-ubuntu20:~$ python distributed_training_with_torch.py
Running on 2 GPUs
Traceback (most recent call last):
File "/home/suryanarayanay/distributed_training_with_torch.py", line 260, in <module>
torch.multiprocessing.start_processes(
File "/home/suryanarayanay/miniconda3/envs/tf2.13/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 202, in start_processes
while not context.join():
^^^^^^^^^^^^^^
File "/home/suryanarayanay/miniconda3/envs/tf2.13/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 163, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/suryanarayanay/miniconda3/envs/tf2.13/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 74, in _wrap
fn(i, *args)
File "/home/suryanarayanay/distributed_training_with_torch.py", line 232, in per_device_launch_fn
setup_device(current_gpu_index, num_gpu)
File "/home/suryanarayanay/distributed_training_with_torch.py", line 207, in setup_device
torch.cuda.set_device(device)
File "/home/suryanarayanay/miniconda3/envs/tf2.13/lib/python3.11/site-packages/torch/cuda/__init__.py", line 404, in set_device
torch._C._cuda_setDevice(device)
File "/home/suryanarayanay/miniconda3/envs/tf2.13/lib/python3.11/site-packages/torch/cuda/__init__.py", line 284, in _lazy_init
raise RuntimeError(
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
With start_method="spawn"
:
(tf2.13) suryanarayanay@surya-ubuntu20:~$ python distributed_training_with_torch.py
Running on 2 GPUs
Running on 2 GPUs
Running on 2 GPUs
x_train shape: (60000, 28, 28, 1)
Traceback (most recent call last):
File "/home/suryanarayanay/distributed_training_with_torch.py", line 260, in <module>
torch.multiprocessing.start_processes(
File "/home/suryanarayanay/miniconda3/envs/tf2.13/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 202, in start_processes
while not context.join():
^^^^^^^^^^^^^^
File "/home/suryanarayanay/miniconda3/envs/tf2.13/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 145, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGKILL
Hmm, you seem to have a different error compared to me. I don't know if it is the same or not.
The error "Functional has no attribute parameters" seems to suggest you're not using the torch backend. Try to print keras.backend.backend()
and see what you get. Note that the original notebook uses the env variable KERAS_BACKEND
to set the backend.
I ran the demo again and ran keras.backend.backend()
at the end. I think I'm getting new errors now
Using "fork":
Using "spawn":
The code i used without keras.backend.backend()
:
https://colab.research.google.com/github/keras-team/keras-io/blob/master/guides/ipynb/distributed_training_with_torch.ipynb
It runs in terminal but not in notebooks.
Notebooks only support "fork"
as the start_method
, but it is not supported by torch with CUDA.
"spawn"
is supported by torch but not by notebook.
We will leave this issue open to see if there are new solutions coming up.
The current solution is just to download the notebook as .py
file and run it from the terminal.
I'm running into the same issue, even in non-notebooks. Any updates so far @haifeng-jin?
As the title says, I'm having problems running the example code, which is given here: Multi-GPU distributed training with PyTorch
I want to know if this error is only at my end or is it repeatable. I have also tried to set start method to "start" with no help.
Note: I'm kinda of new to Keras and I'm trying to learn.