int8 quantization doesn't work with accelerate on multi-GPUs

giulio98 commented 1 year ago

System Info

python 3.8
pytorch 1.12
openmpi 4.1.0
cuda 11.3
cudnn8
ubuntu 20.04
accelerate==0.14.0
transformers==4.24.0
bitsandbytes==0.35.4

1 node with 4xT4 GPUs

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
[X] My own task or dataset (give details below)

Reproduction

import os
from accelerate import Accelerator
from torch.utils.data.dataloader import DataLoader
from tqdm import tqdm
from torch.utils.data.dataset import Dataset
import torch

checkpoint = 'facebook/opt-1.3b'

accelerator = Accelerator()
os.environ["TOKENIZERS_PARALLELISM"] = "false"
model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto", load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained(checkpoint, use_fast=False)
input_list = [
    "Hugging Face is pushing the convention that a unicorn with two horns becomes a llama.",
    "Hello world",
    "Hello my name is",
    "Happy to see you"
]
class CustomDataset(Dataset):

    def __init__(self, txt_list, tokenizer):

        self.tokenizer = tokenizer
        self.input_ids = []
        self.attn_masks = []

        for txt in txt_list:

            encodings_dict = tokenizer(txt, padding=True)

            self.input_ids.append(torch.tensor(encodings_dict['input_ids']))
            self.attn_masks.append(torch.tensor(encodings_dict['attention_mask']))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.attn_masks[idx]

tokenized_dataset = CustomDataset(input_list, tokenizer)

dataloader = DataLoader(tokenized_dataset, batch_size=1)
model, dataloader = accelerator.prepare(model, dataloader)
for step, batch in tqdm(enumerate(dataloader)):
    with torch.no_grad():
        output = accelerator.unwrap_model(model).generate(batch[0], min_length=30, max_length=30, do_sample=True)
        print(tokenizer.decode(output[0].tolist()))

Expected behavior

accelerator.unwrap_model(model).generate(...) should work fine instead fail with the following error: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper__native_layer_norm)

Full backtrace:

/bin/bash: /azureml-envs/pytorch-1.12/lib/libtinfo.so.6: no version information available (required by /bin/bash) 3e8feffd8a7f4157a9116efab3d0ff63000001:78:78 [0] NCCL INFO Bootstrap : Using eth0:10.0.0.5<0> 3e8feffd8a7f4157a9116efab3d0ff63000001:78:78 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol. 3e8feffd8a7f4157a9116efab3d0ff63000001:78:78 [0] NCCL INFO Plugin Path : /usr/local/nccl-rdma-sharp-plugins/lib/libnccl-net.so 3e8feffd8a7f4157a9116efab3d0ff63000001:78:78 [0] NCCL INFO P2P plugin IBext 3e8feffd8a7f4157a9116efab3d0ff63000001:78:78 [0] NCCL INFO NET/IB : No device found. 3e8feffd8a7f4157a9116efab3d0ff63000001:78:78 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1. 3e8feffd8a7f4157a9116efab3d0ff63000001:78:78 [0] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.5<0> 3e8feffd8a7f4157a9116efab3d0ff63000001:78:78 [0] NCCL INFO Using network Socket NCCL version 2.10.3+cuda11.3 3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 1(=6f9100000) and dev 0(=402400000) 3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 2(=71a000000) and dev 0(=402400000) 3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 3(=c45a00000) and dev 0(=402400000) 3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 0(=402400000) and dev 1(=6f9100000) 3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 2(=71a000000) and dev 1(=6f9100000) 3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 3(=c45a00000) and dev 1(=6f9100000) 3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 0(=402400000) and dev 2(=71a000000) 3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 1(=6f9100000) and dev 2(=71a000000) 3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 3(=c45a00000) and dev 2(=71a000000) 3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 0(=402400000) and dev 3(=c45a00000) 3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 1(=6f9100000) and dev 3(=c45a00000) 3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 2(=71a000000) and dev 3(=c45a00000) 3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 1(=6f9100000) and dev 0(=402400000) 3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 2(=71a000000) and dev 0(=402400000) 3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 3(=c45a00000) and dev 0(=402400000) 3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 0(=402400000) and dev 1(=6f9100000) 3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 2(=71a000000) and dev 1(=6f9100000) 3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 3(=c45a00000) and dev 1(=6f9100000) 3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 0(=402400000) and dev 2(=71a000000) 3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 1(=6f9100000) and dev 2(=71a000000) 3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 3(=c45a00000) and dev 2(=71a000000) 3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 0(=402400000) and dev 3(=c45a00000) 3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 1(=6f9100000) and dev 3(=c45a00000) 3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 2(=71a000000) and dev 3(=c45a00000) 3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Channel 00/02 : 0 1 2 3 3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Channel 01/02 : 0 1 2 3 3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Setting affinity for GPU 0 to ffff 3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 0(=402400000) and dev 3(=c45a00000) 3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 0(=402400000) and dev 3(=c45a00000) 3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 0(=402400000) and dev 1(=6f9100000) 3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Channel 00 : 0[402400000] -> 1[6f9100000] via direct shared memory 3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 0(=402400000) and dev 1(=6f9100000) 3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Channel 01 : 0[402400000] -> 1[6f9100000] via direct shared memory 3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Connected all rings 3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 0(=402400000) and dev 1(=6f9100000) 3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Could not enable P2P between dev 0(=402400000) and dev 1(=6f9100000) 3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO Connected all trees 3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512 3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer 3e8feffd8a7f4157a9116efab3d0ff63000001:78:378 [0] NCCL INFO comm 0x7fd570002fb0 rank 0 nranks 4 cudaDev 0 busId 402400000 - Init COMPLETE 3e8feffd8a7f4157a9116efab3d0ff63000001:78:78 [0] NCCL INFO Launch mode Parallel

0it [00:00, ?it/s] 0it [00:02, ?it/s] Traceback (most recent call last): File "test_8bit.py", line 49, in output = accelerator.unwrap_model(model).generate(batch[0], min_length=30, max_length=30, do_sample=True) File "/azureml-envs/pytorch-1.12/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, kwargs) File "/azureml-envs/pytorch-1.12/lib/python3.8/site-packages/transformers/generation_utils.py", line 1543, in generate return self.sample( File "/azureml-envs/pytorch-1.12/lib/python3.8/site-packages/transformers/generation_utils.py", line 2482, in sample outputs = self( File "/azureml-envs/pytorch-1.12/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, *kwargs) File "/azureml-envs/pytorch-1.12/lib/python3.8/site-packages/accelerate/hooks.py", line 156, in new_forward output = old_forward(args, kwargs) File "/azureml-envs/pytorch-1.12/lib/python3.8/site-packages/transformers/models/opt/modeling_opt.py", line 929, in forward outputs = self.model.decoder( File "/azureml-envs/pytorch-1.12/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, kwargs) File "/azureml-envs/pytorch-1.12/lib/python3.8/site-packages/transformers/models/opt/modeling_opt.py", line 693, in forward layer_outputs = decoder_layer( File "/azureml-envs/pytorch-1.12/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, *kwargs) File "/azureml-envs/pytorch-1.12/lib/python3.8/site-packages/accelerate/hooks.py", line 156, in new_forward output = old_forward(args, kwargs) File "/azureml-envs/pytorch-1.12/lib/python3.8/site-packages/transformers/models/opt/modeling_opt.py", line 321, in forward hidden_states = self.self_attn_layer_norm(hidden_states) File "/azureml-envs/pytorch-1.12/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, *kwargs) File "/azureml-envs/pytorch-1.12/lib/python3.8/site-packages/accelerate/hooks.py", line 156, in new_forward output = old_forward(args, **kwargs) File "/azureml-envs/pytorch-1.12/lib/python3.8/site-packages/torch/nn/modules/normalization.py", line 189, in forward return F.layer_norm( File "/azureml-envs/pytorch-1.12/lib/python3.8/site-packages/torch/nn/functional.py", line 2503, in layer_norm return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper__native_layer_norm)

giulio98 commented 1 year ago

The scipt work fine with 1 T4 GPU, the error persist only with multi-GPUs

sgugger commented 1 year ago

The problem is that you are sending your model to Accelerator.prepare which puts it on GPU 0 and destroys the work done by device_map="auto". You should not send it to this method and it will work fine (you will also be able to remove the unwrap).

giulio98 commented 1 year ago

Thanks for the response, I would like to be able to use Accelerator.prepare to split the dataset across all the available GPUs, what you are suggesting works only with 4 sentences on 4 GPUs then the execution hangs.

Reproduction

import os
from accelerate import Accelerator
from torch.utils.data.dataloader import DataLoader
from tqdm import tqdm
from torch.utils.data.dataset import Dataset
import torch

checkpoint = 'facebook/opt-1.3b'

accelerator = Accelerator()
os.environ["TOKENIZERS_PARALLELISM"] = "false"
model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto", load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained(checkpoint, use_fast=False)
input_list = [
    "Hugging Face is pushing the convention that a unicorn with two horns becomes a llama.",
    "Hello world",
    "Hello my name is",
    "Happy to see you",
    "This sentence will not run",
    "This sentence will not run",
    "This sentence will not run",
    "This sentence will not run"
]
class CustomDataset(Dataset):

    def __init__(self, txt_list, tokenizer):

        self.tokenizer = tokenizer
        self.input_ids = []
        self.attn_masks = []

        for txt in txt_list:

            encodings_dict = tokenizer(txt, padding=True)

            self.input_ids.append(torch.tensor(encodings_dict['input_ids']))
            self.attn_masks.append(torch.tensor(encodings_dict['attention_mask']))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.attn_masks[idx]

tokenized_dataset = CustomDataset(input_list, tokenizer)

dataloader = DataLoader(tokenized_dataset, batch_size=1)
dataloader = accelerator.prepare(dataloader)
for step, batch in tqdm(enumerate(dataloader)):
    with torch.no_grad():
        output = model.generate(batch[0], min_length=30, max_length=30, do_sample=True)
        print(tokenizer.decode(output[0].tolist()))

sgugger commented 1 year ago

You can't use data parallelism with device_map="auto": the model will expect the inputs to be on GPU 0, then one part will be computed on GPU 0, then 1, 2, 3.

giulio98 commented 1 year ago

Hello, this behavior indeed is quite strange, if the script above works for the first batch I don't see why it shouldn't work for the second batch. In any case, is possible to find somewhere a list of the libraries supported by accelerate and the ones not supported? For the moment is not very clear how to use this library with int8 quantization and deepspeed_for_inference.

pacman100 commented 1 year ago

Hello @giulio98, https://github.com/huggingface/accelerate#supported-integrations has the list of all the integrations supported by Accelerate. For more details and guidance on how to use these, please refer the How-To Guides of the docs: https://huggingface.co/docs/accelerate/index

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Pacific-wide commented 1 year ago

Is there any solution to use data parallelism with int8 quantized model?

giulio98 commented 6 months ago

Hi, I'm reopening this issue to inquire whether it's currently feasible to perform inference across multiple GPUs (by distributing the weights on multiple GPUs) while employing data parallelism. Specifically, is it viable to utilize PyTorch's Fully Sharded Data Parallel (FSDP) for this purpose?

muellerzr commented 6 months ago

@giulio98 we're working on that, stay tuned :) https://github.com/huggingface/accelerate/pull/2345

(via fsdp, the answer is still the same)

huggingface / accelerate