tunib-ai / parallelformers

Parallelformers: An Efficient Model Parallelization Toolkit for Deployment
https://tunib-ai.github.io/parallelformers
Apache License 2.0
776 stars 61 forks source link

RuntimeError: Cannot re-initialize CUDA in forked subprocess #32

Closed cabal-daniel closed 2 years ago

cabal-daniel commented 2 years ago

How to reproduce

I'm getting the following error while trying to run the example in the getting started document

Process ParallelProcess-1:
Traceback (most recent call last):
  File "/usr/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/parallelformers/parallel/process.py", line 251, in run
    engine = ParallelEngine(
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/parallelformers/parallel/engine.py", line 53, in __init__
    self.mp_group = self.create_process_group(backend)
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/parallelformers/parallel/engine.py", line 106, in create_process_group
    torch.cuda.set_device(int(os.getenv("LOCAL_RANK", "0")))
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/torch/cuda/__init__.py", line 314, in set_device
    torch._C._cuda_setDevice(device)
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/torch/cuda/__init__.py", line 207, in _lazy_init
    raise RuntimeError(
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
Process ParallelProcess-2:
Traceback (most recent call last):
  File "/usr/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/parallelformers/parallel/process.py", line 251, in run
    engine = ParallelEngine(
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/parallelformers/parallel/engine.py", line 53, in __init__
    self.mp_group = self.create_process_group(backend)
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/parallelformers/parallel/engine.py", line 104, in create_process_group
    dist.init_process_group(backend=backend)
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 627, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 242, in _store_based_barrier
    worker_count = store.add(store_key, 0)
RuntimeError: Connection reset by peer

This is my code. I'm running it on a AWS g5.12xlarge instance with 4 GPUs

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neo-2.7B")
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-2.7B")

from parallelformers import parallelize

parallelize(model, num_gpus=2, fp16=True, verbose='detail')

inputs = tokenizer("Parallelformers is", return_tensors="pt")

outputs = model.generate(
    **inputs,
    num_beams=5,
    no_repeat_ngram_size=4,
    max_length=15,
)

print(f"Output: {tokenizer.batch_decode(outputs)[0]}")
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A10G         On   | 00000000:00:1B.0 Off |                    0 |
|  0%   29C    P8    19W / 300W |      2MiB / 23028MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A10G         On   | 00000000:00:1C.0 Off |                    0 |
|  0%   29C    P8    16W / 300W |      2MiB / 23028MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A10G         On   | 00000000:00:1D.0 Off |                    0 |
|  0%   29C    P8    16W / 300W |      2MiB / 23028MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A10G         On   | 00000000:00:1E.0 Off |                    0 |
|  0%   30C    P8    15W / 300W |      2MiB / 23028MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

I pip installed multiprocess https://pypi.org/project/multiprocess/ as initially I kept getting importing multiprocess as mp, multiprocess not found. Then I noticed there was a PR that removed torch.multiprocessing done by @Oaklight . Maybe I'm not using the right multiprocessing library? Reverting it back to torch.multiprocessing caused the same error noticed by @Oaklight .

Environment

hyunwoongko commented 2 years ago

hmm it's weird. the default value of init_method is already spawn. https://github.com/tunib-ai/parallelformers/blob/main/parallelformers/parallelize.py#L81

which GPU are you using?

hyunwoongko commented 2 years ago

and you don't need to install multiprocessing because it's builtin module of python. how about removing that package?

cabal-daniel commented 2 years ago

I'm using NVIDIA A10G, I posted my nvidia-smi

I guess I don't have the default multiprocessing in python but it's python 3.9.5

Also I got it back to work by reverting @Oaklight 's code and just putting the __name__ == '__main__' guard like so:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neo-2.7B")
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-2.7B")

from parallelformers import parallelize

if __name__ == '__main__':
    parallelize(model, num_gpus=2, fp16=True, verbose='detail')

    inputs = tokenizer("Parallelformers is", return_tensors="pt")

    outputs = model.generate(
        **inputs,
        num_beams=5,
        no_repeat_ngram_size=4,
        max_length=15,
    )

    print(f"Output: {tokenizer.batch_decode(outputs)[0]}")
hyunwoongko commented 2 years ago

torch.multiprocessing is almost same with python multiprocessing. how about trying the latest version of the library with __main__?

hyunwoongko commented 2 years ago

ah. @Oaklight changed multiprocessing to multiprocess. it's another library. I didn't catch this. can you try multiprocess with __main__?

cabal-daniel commented 2 years ago

The multiprocess library isn't found and I get the error above. Sorry I had typed multiprocessing originally, but the multiprocess library was not found.

I will trying running with the native multiprocessing library with main in a minute.

hyunwoongko commented 2 years ago

how about installing the package using pip install multiprocess and trying with __main__?

cabal-daniel commented 2 years ago

That's what I initially did and it failed with the error I listed. I will soon try with import multiprocessing just as shown in the main branch once my inferencing job finishes.

hyunwoongko commented 2 years ago

I just updated the library to 1.2.6. please try this. Thanks !

cabal-daniel commented 2 years ago

Ok the multiprocess library is no good

(.venv) ubuntu@ip-172-31-7-113:~$ python opt_multi.py 
Process ParallelProcess-1:
Traceback (most recent call last):
  File "/usr/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/parallelformers/parallel/process.py", line 251, in run
    engine = ParallelEngine(
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/parallelformers/parallel/engine.py", line 53, in __init__
    self.mp_group = self.create_process_group(backend)
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/parallelformers/parallel/engine.py", line 106, in create_process_group
    torch.cuda.set_device(int(os.getenv("LOCAL_RANK", "0")))
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/torch/cuda/__init__.py", line 314, in set_device
    torch._C._cuda_setDevice(device)
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/torch/cuda/__init__.py", line 207, in _lazy_init
    raise RuntimeError(
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
Process ParallelProcess-3:
Process ParallelProcess-4:
Process ParallelProcess-2:
Traceback (most recent call last):
  File "/usr/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/parallelformers/parallel/process.py", line 251, in run
    engine = ParallelEngine(
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/parallelformers/parallel/engine.py", line 53, in __init__
    self.mp_group = self.create_process_group(backend)
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/parallelformers/parallel/engine.py", line 104, in create_process_group
    dist.init_process_group(backend=backend)
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 627, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 242, in _store_based_barrier
    worker_count = store.add(store_key, 0)
RuntimeError: Connection reset by peer
Traceback (most recent call last):
Traceback (most recent call last):
  File "/usr/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/parallelformers/parallel/process.py", line 251, in run
    engine = ParallelEngine(
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/parallelformers/parallel/process.py", line 251, in run
    engine = ParallelEngine(
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/parallelformers/parallel/engine.py", line 53, in __init__
    self.mp_group = self.create_process_group(backend)
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/parallelformers/parallel/engine.py", line 53, in __init__
    self.mp_group = self.create_process_group(backend)
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/parallelformers/parallel/engine.py", line 104, in create_process_group
    dist.init_process_group(backend=backend)
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/parallelformers/parallel/engine.py", line 104, in create_process_group
    dist.init_process_group(backend=backend)
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 627, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 627, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 242, in _store_based_barrier
    worker_count = store.add(store_key, 0)
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 242, in _store_based_barrier
    worker_count = store.add(store_key, 0)
RuntimeError: Connection reset by peer
RuntimeError: Connection reset by peer

Now if I do

# import torch.multiprocessing as mp
# import multiprocess as mp
import multiprocessing as mp

and I put the __main__ guard it works fine.

Suggestion: Remove mutliprocess as mp for multiprocessing as mp and in the instructions inform users to put if __name__ == '__main__': in the parallelize command.

hyunwoongko commented 2 years ago

Thanks for reporting. I'll apply your suggestion. thanks !

cabal-daniel commented 2 years ago

Awesome, thanks for your help. This is an amazing library!