Model Parallelism for Bert Models

saichandrapandraju commented 3 years ago

Hi,

I'm trying to implement Model parallelism for BERT models by splitting and assigning layers across GPUs. I took DeBERTa as an example for this. For DeBERTa, I'm able to split entire model into 'embedding', 'encoder', 'pooler', 'classifier' and 'dropout' layers as shown in below pic.

Capture

With this approach, I trained on IMDB classification task by assigning 'encoder' to second GPU and others to first 'GPU'. At the end of the training, second GPU consumed lot of memory when compared to first GPU and this resulted in 20-80 split of the entire model.

So, I tried splitting encoder layers also as shown below but getting this error - "TypeError: forward() takes 1 positional argument but 2 were given"

embed = dberta.deberta.embeddings.to('cuda:0')

f6e = dberta.deberta.encoder.layer[:6].to('cuda:0')

l6e = dberta.deberta.encoder.layer[6:].to('cuda:1')

pooler = dberta.pooler.to('cuda:0')

classifier = dberta.classifier.to('cuda:0')

dropout = dberta.dropout.to('cuda:0')

test = "this is to test deberta"

inp_ids = tok_dberta(test, return_tensors='pt').input_ids
att_mask = tok_dberta(test, return_tensors='pt').attention_mask

emb_out = embed(inp_ids.to('cuda:0'))

first_6_enc_lay_out = f6e(emb_out)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-15-379d948e5ba5> in <module>
----> 1 first_6_enc_lay_out = f6e(emb_out)

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    725             result = self._slow_forward(*input, **kwargs)
    726         else:
--> 727             result = self.forward(*input, **kwargs)
    728         for hook in itertools.chain(
    729                 _global_forward_hooks.values(),

TypeError: forward() takes 1 positional argument but 2 were given

Plz suggest how to proceed further..

stas00 commented 3 years ago

We already have naive vertical MP implemented in t5 and gpt, and there is a much easier version of Bart MP - but it's not merged (https://github.com/huggingface/transformers/pull/9384).

The problem with naive MP is that it's very inefficient. That's why at the moment the rest of transformers isn't being ported.

Until then try HF Trainer DeepSpeed integration: https://huggingface.co/blog/zero-deepspeed-fairscale

Pipeline is the next in line, but it's very complicated.

Naive vertical MP is Pipeline with chunks=1.

See my work in progress notes on Parallelism: https://github.com/huggingface/transformers/issues/9766

saichandrapandraju commented 3 years ago

Thanks @stas00 for sharing your work. I'll implement DeepSpeed with HF..

saichandrapandraju commented 3 years ago

Hi @stas00 ,

As mentioned above, I installed deepspeed and used HF Trainer to train instead of native pytorch. Without DeepSpeed, I'm able to complete the training but with DeepSpeed, execution is stuck at - [2021-02-17 15:05:24,441] [INFO] [distributed.py:40:init_distributed] Initializing torch distributed with backend: nccl .

complete log is -

[2021-02-17 15:05:06,621] [WARNING] [runner.py:117:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2021-02-17 15:05:06,736] [INFO] [runner.py:355:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 ./Deepspeed.py --output_dir test1 --overwrite_output_dir --do_train --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --learning_rate 3e-5 --weight_decay 0.01 --num_train_epochs 1 --load_best_model_at_end --deepspeed ds_config.json
[2021-02-17 15:05:08,344] [INFO] [launch.py:78:main] WORLD INFO DICT: {'localhost': [0]}
[2021-02-17 15:05:08,344] [INFO] [launch.py:87:main] nnodes=1, num_local_procs=1, node_rank=0
[2021-02-17 15:05:08,345] [INFO] [launch.py:99:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2021-02-17 15:05:08,345] [INFO] [launch.py:100:main] dist_world_size=1
[2021-02-17 15:05:08,345] [INFO] [launch.py:103:main] Setting CUDA_VISIBLE_DEVICES=0
2021-02-17 15:05:10.792753: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Some weights of the model checkpoint at /home/jovyan/models/roberta-large/ were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at /home/jovyan/models/roberta-large/ and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.out_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
loaded df
Encoding done
parser created
[2021-02-17 15:05:24,441] [INFO] [distributed.py:40:init_distributed] Initializing torch distributed with backend: nccl

I'm passing below in cmd -

!deepspeed ./Deepspeed.py --output_dir test1 --overwrite_output_dir --do_train \
--per_device_train_batch_size 8 --per_device_eval_batch_size 8 --learning_rate 3e-5 --weight_decay 0.01 --num_train_epochs 1 \
--load_best_model_at_end --deepspeed ds_config.json

Here's my simple script -

from transformers import RobertaForSequenceClassification, RobertaTokenizerFast, Trainer, TrainingArguments, HfArgumentParser
import pandas as pd
import numpy as np
import torch
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

tok = RobertaTokenizerFast.from_pretrained('/home/jovyan/models/roberta-large/')
model = RobertaForSequenceClassification.from_pretrained('/home/jovyan/models/roberta-large/', num_labels=2)

df_full = pd.read_csv('IMDB_Dataset.csv')
print("loaded df")
df_full = df_full.sample(frac=1).reset_index(drop=True)
df_req =  df_full.head(1000)
df_train = df_req.head(800)
df_eval = df_req.tail(200)

train_text, train_labels_raw, val_text, val_labels_raw = df_train.review.values.tolist(), df_train.sentiment.values.tolist(), df_eval.review.values.tolist(), df_eval.sentiment.values.tolist()

train_encodings = tok(train_text, padding=True, truncation=True, max_length=512)
val_encodings = tok(val_text, padding=True, truncation=True, max_length=512)
train_labels = [1 if i=='positive' else 0 for i in train_labels_raw]
val_labels = [1 if i=='positive' else 0 for i in val_labels_raw]

class IMDbDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = IMDbDataset(train_encodings, train_labels)
val_dataset = IMDbDataset(val_encodings, val_labels)
print("Encoding done")

parser = HfArgumentParser(TrainingArguments)
print('parser created')
train_args = parser.parse_args_into_dataclasses()

print('got training')
print(train_args[0])

trainer = Trainer(
             model=model,
             args=train_args[0],
             train_dataset=train_dataset,
             eval_dataset=val_dataset
             )

print('------------TRAINING-------------')
trainer.train()

Plz let me know if I missed anything..

stas00 commented 3 years ago

This looks like a pytorch distributed issue, can you launch your script as following?

python -m torch.distributed.launch --nproc_per_node=1 ./Deepspeed.py --output_dir test1 --overwrite_output_dir --do_train \
--per_device_train_batch_size 8 --per_device_eval_batch_size 8 --learning_rate 3e-5 --weight_decay 0.01 --num_train_epochs 1 \
--load_best_model_at_end

Deespeed requires a distributed env even with one gpu. so in this experiment we remove DeepSpeed completely but launch a similar distributed environment for a single process.

What's the output of: python -m torch.utils.collect_env on that system? Are you running on a recent pytorch version? I'm noticing that I have a different distributed.py, since the logger reports a different line number on my side:

[2021-02-17 09:36:01,176] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl

Also, I'm noticing your trying to run it from a notebook. This could be related as well. Any reason why you're not using a normal console? Are you on colab or some restricted environment?

Though I checked I can launch deepspeed just fine from the notebook. via !deepspeed or %%bash cell.

Alternatively you can launch your script via the native notebook, i.e. no script, using this: https://huggingface.co/transformers/master/main_classes/trainer.html#deployment-in-notebooks

But let's see if we can resolve the distributed hanging, by first ensuring your are on a recent pytorch. I see bug reports for this in older pytorch versions (from 2018-2019)

saichandrapandraju commented 3 years ago

Hi @stas00 , Thanks for reverting. Here are the results for above experiment -

!python -m torch.distributed.launch --nproc_per_node=1 ./Deepspeed.py --output_dir test1 --overwrite_output_dir --do_train \
--per_device_train_batch_size 8 --per_device_eval_batch_size 8 --learning_rate 3e-5 --weight_decay 0.01 --num_train_epochs 1 \
--load_best_model_at_end

with the above command, execution got hanged and below is the output -

2021-02-18 01:29:23.513697: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Some weights of the model checkpoint at /home/jovyan/models/roberta-large/ were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at /home/jovyan/models/roberta-large/ and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.out_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
loaded df
Encoding done
parser created

I'm using transformers-4.3.0 and below is the detailed output for !python -m torch.utils.collect_env -


Collecting environment information...
PyTorch version: 1.7.1
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64) GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 Clang version: Could not collect CMake version: Could not collect

Python version: 3.6 (64-bit runtime) Is CUDA available: True CUDA runtime version: 10.1.243 GPU models and configuration: GPU 0: Tesla V100-SXM2-32GB Nvidia driver version: 450.51.06 cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.4 HIP runtime version: N/A MIOpen runtime version: N/A

Versions of relevant libraries: [pip3] kubeflow-pytorchjob==0.1.3 [pip3] numpy==1.18.5 [pip3] torch==1.7.1 [pip3] torchvision==0.8.2 [conda] Could not collect


3.
I am using kubeflow notebook servers provided by my company. So that's why I'm running commands in notebook itself..

4.

I tried by setting env variables as mentioned  in https://huggingface.co/transformers/master/main_classes/trainer.html#deployment-in-notebooks and execution got hanged in below cell - 
![image](https://user-images.githubusercontent.com/41769919/108292042-58f0da00-71b9-11eb-84be-fa9718909a66.png)

stas00 commented 3 years ago

Thank you for your detailed answers, @saichandrapandraju

It feels like your environment can't run pytorch distributed. Here is a very simple test to check that the launcher + dist init works:

%%bash
echo 'import os, torch; print(os.environ["LOCAL_RANK"]); torch.distributed.init_process_group("nccl")' >  test.py
python -m torch.distributed.launch --nproc_per_node=1 test.py

you can copy-n-paste it as is into a new cell including bash magic and then run it.

It should print 0 and not fail.

And if it fails, perhaps trying a different backend instead of nccl? what if you try gloo? But I don't think it'd do any good if it does work with gloo, as it doesn't support the same ops as nccl https://pytorch.org/docs/stable/distributed.html#backends

If this test fails let me know and I will ask if Deepspeed can support any other way. Normally distributed isn't needed for 1 gpu, but since the cpu acts as a sort of another gpu, they use the distributed environment to communicate between the two units.

stas00 commented 3 years ago

This looks like a potential thread to explore for the hanging " Initializing torch distributed with backend: nccl ":

https://discuss.pytorch.org/t/unexpected-hang-up-when-using-distributeddataparallel-on-two-machines/92262

See if you have any luck identifying the problem with the suggestions in that thread.

saichandrapandraju commented 3 years ago

Hi @stas00 ,

with below command it got hanged again

%%bash
echo 'import os, torch; print(os.environ["LOCAL_RANK"]); torch.distributed.init_process_group("nccl")' >  test.py
python -m torch.distributed.launch --nproc_per_node=1 test.py

But returned 0 with gloo

same after trying https://discuss.pytorch.org/t/unexpected-hang-up-when-using-distributeddataparallel-on-two-machines/92262

Below versions are different. Is it fine?

CUDA runtime version: 10.1.243
CUDA used to build PyTorch: 10.2

stas00 commented 3 years ago

So this is a pure pytorch issue, you may want to file an Issue with pytorch: https://github.com/pytorch/pytorch/issues

If you can't launch distributed then DeepSpeed won't work for you.

Also I'd try pytorch-nightly - I read in one thread they have been tweaking this functionality since the last release. https://pytorch.org/get-started/locally/ - you should be able to install that locally.

Below versions are different. Is it fine?
CUDA runtime version: 10.1.243
CUDA used to build PyTorch: 10.2

Shouldn't be a problem. Pytorch comes with its own toolkit.

This system-wide entry is useful for when building pytorch CPP extensions (which incidentally Deepspeed is). There ideally you want to have the same version for both, but sometimes minor version difference is not a problem.

saichandrapandraju commented 3 years ago

Thanks @stas00 ,

Raised an issue https://github.com/pytorch/pytorch/issues/52433 and https://discuss.pytorch.org/t/hanging-torch-distributed-init-process-group/112223

Even I'm thinking of nightly. Will give it a try...

saichandrapandraju commented 3 years ago

If this is sorted out, I hope HFTrainer and deepspeed will work with single and multi gpu setting..

stas00 commented 3 years ago

I'd help for you to augment your pytorch Issue with the information they request - at the very least the output of python -m torch.utils.collect_env and probably mention that you're running from a notebook and in a kubeflow container. Because as you presented it now, they won't know what to do with it, as such code works just fine on a normal setup.

saichandrapandraju commented 3 years ago

Thanks @stas00 ,

I installed 1.7.1+cu101 and below returned 0

%%bash
echo 'import os, torch; print(os.environ["LOCAL_RANK"]); torch.distributed.init_process_group("nccl")' >  test.py
python -m torch.distributed.launch --nproc_per_node=1 test.py

But it got hanged again with script and below are the logs -

2021-02-18 19:00:28.946359: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Some weights of the model checkpoint at /home/jovyan/models/roberta-large/ were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at /home/jovyan/models/roberta-large/ and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.out_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
loaded df
Encoding done
fastai-c2-0:13993:13993 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
fastai-c2-0:13993:13993 [0] NCCL INFO NCCL_SOCKET_IFNAME set to eth0
fastai-c2-0:13993:13993 [0] NCCL INFO Bootstrap : Using [0]eth0:10.244.2.134<0>
fastai-c2-0:13993:13993 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

fastai-c2-0:13993:13993 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
fastai-c2-0:13993:13993 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
fastai-c2-0:13993:13993 [0] NCCL INFO NCCL_SOCKET_IFNAME set to eth0
fastai-c2-0:13993:13993 [0] NCCL INFO NET/Socket : Using [0]eth0:10.244.2.134<0>
fastai-c2-0:13993:13993 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda10.1

Also tried with nightly build(1.9.0.dev20210218+cu101) and got 0 for that bash command, but now it hanged at trainer.train() and below are the logs -

2021-02-18 19:28:13.170701: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Some weights of the model checkpoint at /home/jovyan/models/roberta-large/ were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at /home/jovyan/models/roberta-large/ and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.out_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
loaded df
Encoding done
parser and args created
------------TRAINING-------------
fastai-c2-0:14431:14431 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
fastai-c2-0:14431:14431 [0] NCCL INFO NCCL_SOCKET_IFNAME set to eth0
fastai-c2-0:14431:14431 [0] NCCL INFO Bootstrap : Using [0]eth0:10.244.2.134<0>
fastai-c2-0:14431:14431 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

fastai-c2-0:14431:14431 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
fastai-c2-0:14431:14431 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
fastai-c2-0:14431:14431 [0] NCCL INFO NCCL_SOCKET_IFNAME set to eth0
fastai-c2-0:14431:14431 [0] NCCL INFO NET/Socket : Using [0]eth0:10.244.2.134<0>
fastai-c2-0:14431:14431 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda10.1

used the same script for both -

from transformers import RobertaForSequenceClassification, RobertaTokenizerFast, Trainer, TrainingArguments, HfArgumentParser
import pandas as pd
import numpy as np
import torch
import os

os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ['NCCL_DEBUG']='INFO'
os.environ['NCCL_DEBUG_SUBSYS']='ALL'
os.environ['NCCL_IB_DISABLE']='1'
os.environ['NCCL_SOCKET_IFNAME']='eth0'

tok = RobertaTokenizerFast.from_pretrained('/home/jovyan/models/roberta-large/')
model = RobertaForSequenceClassification.from_pretrained('/home/jovyan/models/roberta-large/', num_labels=2)

df_full = pd.read_csv('IMDB_Dataset.csv')
print("loaded df")
df_full = df_full.sample(frac=1).reset_index(drop=True)
df_req =  df_full.head(1000)
df_train = df_req.head(800)
df_eval = df_req.tail(200)
train_text, train_labels_raw, val_text, val_labels_raw = df_train.review.values.tolist(), df_train.sentiment.values.tolist(), df_eval.review.values.tolist(), df_eval.sentiment.values.tolist(),

train_encodings = tok(train_text, padding=True, truncation=True, max_length=512)
val_encodings = tok(val_text, padding=True, truncation=True, max_length=512)
train_labels = [1 if i=='positive' else 0 for i in train_labels_raw]
val_labels = [1 if i=='positive' else 0 for i in val_labels_raw]

class IMDbDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = IMDbDataset(train_encodings, train_labels)
val_dataset = IMDbDataset(val_encodings, val_labels)

print("Encoding done")

parser = HfArgumentParser(TrainingArguments)
train_args = parser.parse_args_into_dataclasses()
print('parser and args created')

trainer = Trainer(
             model=model,
             args=train_args[0],
             train_dataset=train_dataset,
             eval_dataset=val_dataset
             )
if train_args[0].do_train:
    print('------------TRAINING-------------')
    trainer.train() 
if train_args[0].do_eval:
    print('------------EVALUATING-------------')
    trainer.evaluate()

Updated same in pytorch issues and forums as well ... Wanted to let you know about the progress.

stas00 commented 3 years ago

I installed 1.7.1+cu101 and below returned 0

%%bash
echo 'import os, torch; print(os.environ["LOCAL_RANK"]); torch.distributed.init_process_group("nccl")' >  test.py
python -m torch.distributed.launch --nproc_per_node=1 test.py

That's a good step forward, I'm glad it worked. From what I understand system-wide cuda shouldn't have impact on whether distributed works or not, but clearly in your case it did.

How can I reproduce your setup? I don't know where you got your dataset from. As suggested earlier if you want to save my time, please setup a public google colab notebook (free) and then me and others can easily look at the situation without needing to figure out how to set up our own.

saichandrapandraju commented 3 years ago

Hi @stas00 ,

Here is the colab version of my script. I used IMDB from kaggle in local but in colab I gave a download and extractable version. Also, I included torch and transformers versions that I'm using.

stas00 commented 3 years ago

Thank you, but have you tried running it? It fails in many cells, perhaps I wasn't clear but the idea was to give us a working notebook and then it's easier to spend the time trying to understand the problem, rather than trying to figure out how to make it run - does it make sense?

stas00 commented 3 years ago

Hmm, you're running on a system with multi-gpus, correct? In one threads I found out that if a vm is used and NVLink they may not work unless properly configured, and that person solved the problem with:

export NCCL_P2P_DISABLE=1

which disables NVLink between the 2 cards and switches to the slower PCIe bridge connection.

Could you try and check that this is not your case?

saichandrapandraju commented 3 years ago

So sorry for that.. But in colab everything works just fine with same library versions that I'm using. Here is the updated one along with outputs.

I have 3 VM's where 1 is having 2 GPU's and rest with single GPU. Currently I'm trying in one of the VM with single GPU and if everything is fine we'll replicate this to 2 GPU VM or combine all 4 V100-32GB GPU's for bigger models. This is the higher level roadmap.

with deepspeed :

I tried exact colab that I shared in my notebook server and it is hanging here -

normal torch.distributed : Same with script using torch.distributed.launch and it also hangs at trainer.train() with below log -

parser and args created
fastai-c2-0:22177:22177 [0] NCCL INFO Bootstrap : Using [0]eth0:10.244.2.134<0>
fastai-c2-0:22177:22177 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

fastai-c2-0:22177:22177 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
fastai-c2-0:22177:22177 [0] NCCL INFO NET/Socket : Using [0]eth0:10.244.2.134<0>
fastai-c2-0:22177:22177 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda10.1

same with export NCCL_P2P_DISABLE=1

But now it's not hanging at ' Initializing torch distributed with backend: nccl ' anymore -

saichandrapandraju commented 3 years ago

Will there be any potential configuration issue..? But I think everything should work with 1 GPU. Correct me if I'm wrong.

saichandrapandraju commented 3 years ago

Hi @stas00 ,

It's working with NCCL_SOCKET_IFNAME=lo from this thread.

both of the below were working now -

!NCCL_SOCKET_IFNAME=lo python -m torch.distributed.launch --nproc_per_node=1 ./Seq2Seq.py --output_dir ./out_dir/results --overwrite_output_dir --do_train \
--do_eval --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --learning_rate 3e-5 --weight_decay 0.01 \
--num_train_epochs 1 --load_best_model_at_end --local_rank 0

and

!NCCL_SOCKET_IFNAME=lo deepspeed ./Seq2Seq.py --output_dir ./out_dir/results --overwrite_output_dir --do_train \
--do_eval --per_device_train_batch_size 12 --per_device_eval_batch_size 12 --learning_rate 3e-5 --weight_decay 0.01 \
--num_train_epochs 1 --load_best_model_at_end --local_rank 0 --deepspeed ds_config.json

Not sure exactly what it's doing internally. I will check in other scenarios like multi-GPU and let you know...

stas00 commented 3 years ago

Yay, so glad to hear you found a solution, @saichandrapandraju!

Thank you for updating the notebook too!

If the issue has been fully resolved for you please don't hesitate to close this Issue.

If some new problem occurs, please open a new dedicated issue. Thank you.

saichandrapandraju commented 3 years ago

Tested DeepSpeed on multi-GPU as well and it worked !!

By setting NCCL_SOCKET_IFNAME=lo, everything worked as expected.

Thanks a lot @stas00

huggingface / transformers

Model Parallelism for Bert Models #10151