pytorch / torchrec

Pytorch domain library for recommendation systems
https://pytorch.org/torchrec/
BSD 3-Clause "New" or "Revised" License
1.92k stars 423 forks source link

NCCL error while instantiating DistributedModelParallel #328

Closed getchebarne closed 2 years ago

getchebarne commented 2 years ago

Hello,

I'm trying to train a TorchRec model in a single node with two Nvidia A100 GPUs.

(torchrec) jupyter@ctr-model-gpu-a100-2:~/hft/ctr-model$ nvidia-smi
Fri May 13 15:04:20 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   36C    P0    58W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  Off  | 00000000:00:05.0 Off |                    0 |
| N/A   34C    P0    57W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
(torchrec) jupyter@ctr-model-gpu:~$ python -c "import torch; print(torch.version.cuda)"
11.3

I installed TorchRec and FBGEMM from source. My TorchRec version:

(torchrec) jupyter@ctr-model-gpu:~/hft/ctr-model$ pip freeze | grep torchrec
torchrec==0.1.0

Below's the script I'm trying to run. I've replaced the model with a very simple EBC to discard issues on my model's architecture.

import json
import os

import torch
import torchrec
import torch.distributed as dist
import torch.multiprocessing as mp
from sklearn.metrics import roc_auc_score
from torch.utils.data import DataLoader
from torchrec.distributed import DistributedModelParallel as DMP
from tqdm import tqdm

import datasets
import models
from config_parser import ConfigParser

def main(rank: int, config: ConfigParser) -> None:
    # initalize process group
    dist.init_process_group(                                   
        backend="nccl",                                         
        init_method="env://",
        world_size=2,
        rank=rank
    )    
    # model
    model = torchrec.EmbeddingBagCollection(
        device=torch.device("meta"),
        tables=[
            torchrec.EmbeddingBagConfig(
                name="product_table",
                embedding_dim=64,
                num_embeddings=4096,
                feature_names=["product"],
                pooling=torchrec.PoolingType.SUM,
            )
        ]
    )
    dmp_model = DMP(
        # module=config.init_object("model", models).to(torch.device("meta")),
        module=model,
        device=torch.device(f"cuda:{rank}")
    )

if __name__ == "__main__":
    # instantiate config parser
    config = ConfigParser()

    # distributed config
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "29500"
    mp.spawn(main, nprocs=2, args=(config,))

When trying to run this code, I get the following error. I used export NCCL_DEBUG=INFO to get NCCL's logs.

Traceback (most recent call last):
  File "main_dist.py", line 165, in <module>
    mp.spawn(main, nprocs=2, args=(config,))
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/home/jupyter/hft/ctr-model/main_dist.py", line 29, in main
    module=model,
UnboundLocalError: local variable 'model' referenced before assignment

(base) jupyter@ctr-model-gpu:~/hft/ctr-model$ python main_dist.py 
Log directory 'runs/test-torchrec-dist' already exists. Overwrite? [y / n] y
ctr-model-gpu:30732:30732 [0] NCCL INFO Bootstrap : Using ens6:10.138.0.12<0>
ctr-model-gpu:30732:30732 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
ctr-model-gpu:30732:30732 [0] NCCL INFO NET/FastSocket : Tx CPU start: -2
ctr-model-gpu:30732:30732 [0] NCCL INFO NET/FastSocket : Rx CPU start: -2
ctr-model-gpu:30732:30732 [0] NCCL INFO NET/FastSocket : Flow placement enabled.
ctr-model-gpu:30732:30732 [0] NCCL INFO NET/FastSocket : queue skip: 0
ctr-model-gpu:30732:30732 [0] NCCL INFO NET/FastSocket : Using [0]ens6:10.138.0.12<0>
ctr-model-gpu:30732:30732 [0] NCCL INFO NET/FastSocket plugin initialized
ctr-model-gpu:30732:30732 [0] NCCL INFO Using network FastSocket
NCCL version 2.10.3+cuda11.3
ctr-model-gpu:30733:30733 [0] NCCL INFO Bootstrap : Using ens6:10.138.0.12<0>
ctr-model-gpu:30733:30733 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
ctr-model-gpu:30733:30733 [0] NCCL INFO NET/FastSocket : Tx CPU start: -2
ctr-model-gpu:30733:30733 [0] NCCL INFO NET/FastSocket : Rx CPU start: -2
ctr-model-gpu:30733:30733 [0] NCCL INFO NET/FastSocket : Flow placement enabled.
ctr-model-gpu:30733:30733 [0] NCCL INFO NET/FastSocket : queue skip: 0
ctr-model-gpu:30733:30733 [0] NCCL INFO NET/FastSocket : Using [0]ens6:10.138.0.12<0>
ctr-model-gpu:30733:30733 [0] NCCL INFO NET/FastSocket plugin initialized
ctr-model-gpu:30733:30733 [0] NCCL INFO Using network FastSocket

ctr-model-gpu:30733:30847 [0] init.cc:521 NCCL WARN Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 40
ctr-model-gpu:30732:30846 [0] init.cc:521 NCCL WARN Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 40
ctr-model-gpu:30733:30847 [0] NCCL INFO init.cc:904 -> 5
ctr-model-gpu:30732:30846 [0] NCCL INFO init.cc:904 -> 5
ctr-model-gpu:30733:30847 [0] NCCL INFO group.cc:72 -> 5 [Async thread]
ctr-model-gpu:30732:30846 [0] NCCL INFO group.cc:72 -> 5 [Async thread]
Traceback (most recent call last):
  File "main_dist.py", line 166, in <module>
    mp.spawn(main, nprocs=2, args=(config,))
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/home/jupyter/hft/ctr-model/main_dist.py", line 42, in main
    device=torch.device(f"cuda:{rank}")
  File "/opt/conda/lib/python3.7/site-packages/torchrec/distributed/model_parallel.py", line 187, in __init__
    plan = planner.collective_plan(module, sharders, pg)
  File "/opt/conda/lib/python3.7/site-packages/torchrec/distributed/planner/planners.py", line 188, in collective_plan
    sharders,
  File "/opt/conda/lib/python3.7/site-packages/torchrec/distributed/collective_utils.py", line 60, in invoke_on_rank_and_broadcast_result
    dist.broadcast_object_list(object_list, rank, group=pg)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1869, in broadcast_object_list
    broadcast(object_sizes_tensor, src=src, group=group)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1187, in broadcast
    work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1646755953518/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

Reading the NCCL logs, I noticed this two lines:

ctr-model-gpu:30733:30847 [0] init.cc:521 NCCL WARN Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 40
ctr-model-gpu:30732:30846 [0] init.cc:521 NCCL WARN Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 40

Could this be the cause of the issue? If so, how do I solve it? I ran another regular PyTorch script with DistributedDataParallel and didn't have any issues with NCCL; the script ran fine.

getchebarne commented 2 years ago

Update: setting torch.cuda.set_device(rank) before initializing the process group seems to fix this issue. Still, this is not needed with DistributedDataParallel.

bigning commented 2 years ago

The torch.cuda.set_device(rank) is needed before calling DistributedModelParallel. I think it's the same for DistributedDataParallel, please see the document here https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#distributeddataparallel

lucy9527 commented 1 year ago

Update: setting torch.cuda.set_device(rank) before initializing the process group seems to fix this issue. Still, this is not needed with DistributedDataParallel. this is effective, thanks!