modeling_t5 incompatible with multiprocessing

rangehow commented 2 months ago

System Info

transformers version: 4.39.0.dev0
Platform: Linux-3.10.0-1160.71.1.el7.x86_64-x86_64-with-glibc2.17
Python version: 3.10.13
Huggingface_hub version: 0.21.4
Safetensors version: 0.4.2
Accelerate version: 0.27.2
Accelerate config: - compute_environment: LOCAL_MACHINE
- distributed_type: DEEPSPEED
- mixed_precision: bf16
- use_cpu: False
- debug: False
- num_processes: 8
- machine_rank: 0
- num_machines: 1
- rdzv_backend: static
- same_network: True
- main_training_function: main
- deepspeed_config: {'gradient_accumulation_steps': 16, 'zero3_init_flag': False, 'zero_stage': 0}
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
PyTorch version (GPU?): 2.2.1 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

Hi, @ArthurZucker and @younesbelkada . I'm trying to split a dataset automatically to multi gpu (a bit like data parallel) for inference. But strange things happen when using t5 model in hf while other models work correctly(i.e. bart), so I guess here exist some problem related to t5 implementation, would you like help checking it out? ：）

Although it has been mentioned online that the error below may be related to OOM, I am certain that it is not. The following code only allows rank0 to obtain normal output, while other ranks will report the following error.

Traceback (most recent call last):
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/data/ruanjh/NiuInference/NiuInference.py", line 97, in get_pred
    output = model.generate(
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/site-packages/transformers/generation/utils.py", line 1388, in generate
    model_kwargs = self._prepare_encoder_decoder_kwargs_for_generation(
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/site-packages/transformers/generation/utils.py", line 503, in _prepare_encoder_decoder_kwargs_for_generation
    model_kwargs["encoder_outputs"]: ModelOutput = encoder(**encoder_kwargs)
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py", line 1115, in forward
    layer_outputs = layer_module(
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py", line 695, in forward
    self_attention_outputs = self.layer[0](
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py", line 602, in forward
    attention_output = self.SelfAttention(
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py", line 521, in forward
    query_states = shape(self.q(hidden_states))  # (batch_size, n_heads, seq_length, dim_per_head)
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/ruanjh/miniconda3/envs/mamba/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 116, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

The following code should be quite easy to reproduce. All you need to do is replace the model_dir in the main function with a specific model, such as Google/t5-v1_1-large , and make sure CUDA VISIBLE DEVICES >1 .

import torch
from torch import bfloat16
import torch.distributed as dist
import torch.multiprocessing as mp

from torch.utils.data import Dataset,DataLoader
import functools
from transformers import AutoTokenizer,DefaultDataCollator,GenerationConfig,PreTrainedModel,AutoModelForSeq2SeqLM,AutoModelForCausalLM,AutoConfig,DataCollatorWithPadding
import logging
from transformers.models.auto.modeling_auto import MODEL_FOR_CAUSAL_LM_MAPPING_NAMES
from tqdm import tqdm
# from accelerate import find_executable_batch_size

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

class DefaultDataset(Dataset):
    def __init__(self,data,tokenizer):
        self.data=tokenizer(data,return_tensors='pt',padding=True)

    def __getitem__(self,idx):
        return {'input_ids':self.data['input_ids'][idx]}

    def __len__(self):
        return self.data['input_ids'].size(0)

class NiuInference:
    def __init__(self,model_dir,data,dtype=bfloat16,dataset=None,data_collator=None,output_path='niuinference.out',auto_batch_size=True,batch_size=1,generation_config=None):
        self.model_dir=model_dir
        self.dtype=dtype
        self.data=data
        self.dataset=dataset
        self.data_collator=data_collator
        self.output_path=output_path
        self.batch_size=batch_size
        self.auto_batch_size=auto_batch_size
        self.generation_config=generation_config

    def _load_model_and_tokenizer(self,device):
        print(self.dtype)
        config=AutoConfig.from_pretrained(self.model_dir)
        if config.model_type in MODEL_FOR_CAUSAL_LM_MAPPING_NAMES:
            model=AutoModelForCausalLM.from_pretrained(self.model_dir,torch_dtype=self.dtype)
        else:
            model=AutoModelForSeq2SeqLM.from_pretrained(self.model_dir,torch_dtype=self.dtype)
        model.to(device)
        tokenizer=AutoTokenizer.from_pretrained(self.model_dir)
        return model,tokenizer

    # @find_executable_batch_size(starting_batch_size=1)
    # def auto_get_pred(batch_size):

    def get_pred(self,rank,out_path,data,dict):
        batch_size=2

        try:
            device = torch.device(f'cuda:{rank}')
            model, tokenizer = self._load_model_and_tokenizer(device)
            if self.dataset is not None:
                dataset=self.dataset(data=data,tokenizer=tokenizer)
            else:
                dataset=DefaultDataset(data=data,tokenizer=tokenizer)

            if self.data_collator is not None:
                collator=self.data_collator(tokenizer,model=model,padding=True)
            else:
                collator= DataCollatorWithPadding(tokenizer)
            dataloader=DataLoader(dataset,batch_size,collate_fn=collator,pin_memory=True,num_workers=0)
            result=[]
            for input in tqdm(dataloader):
                input.to(device)
                print(input)
                output = model.generate(
                            input_ids=input['input_ids'],
                            attention_mask=input['attention_mask'],
                            num_beams=5,
                            do_sample=False,
                            temperature=1.0,
                            max_new_tokens=512,
                        )
                pred = tokenizer.batch_decode(output,skip_special_tokens=True)
                print(pred)
                result+=pred
            dict[f'{rank}']=result
        except Exception as e:
            print('error',device)
            raise

    def split_list(self,lst, n):
        avg = len(lst) / float(n)
        return [lst[int(avg * i):int(avg * (i + 1))] for i in range(n)]

    def run(self,):

        world_size = min(torch.cuda.device_count(),len(self.data)) # corner case， data<available GPU num

        data_subsets = self.split_list(self.data,world_size)
        print(data_subsets)
        processes = []
        manager = mp.Manager()
        record_dict = manager.dict()
        for rank in range(world_size):

            p = mp.Process(target=self.get_pred, args=(rank,self.output_path,data_subsets[rank],record_dict))
            p.start()
            processes.append(p)
        for p in processes:
            p.join()

        with open(self.output_path, "w", encoding="utf-8") as f:
            for rank in range(world_size):
                for r in record_dict[f'{rank}']:
                    f.write(r.replace('\n','\\n')+'\n')

if __name__=='__main__':
    mp.set_start_method('spawn')
    i=NiuInference(model_dir=**replace here to t5 or bart**,data=['hello,how is your day','my wish is that you happy','from scratch',])
    i.run()

Expected behavior

t5 model can inference in multiprocessing.

ArthurZucker commented 2 months ago

t5 is a fairly old model, this is probably expected? If you find a fix feel free to open a PR! 🤗

rangehow commented 2 months ago

t5 is a fairly old model, this is probably expected? If you find a fix feel free to open a PR! 🤗

Yes, but strangely enough, Bart supports it. I would be happy to give it a try, but before that, I would like to know if this issue can be reproduced? This will help me further reduce the scope of investigation.

ArthurZucker commented 2 months ago

To be honest, multiprocessing is outside of the scope of transformers, and we usually recommend the usage of accelerate 😉 . FSDP is also a possible solution, same as deepspeed. Maybe making the tutorials about that more discoverable would be the best solution

rangehow commented 2 months ago

To be honest, multiprocessing is outside of the scope of transformers, and we usually recommend the usage of accelerate 😉 . FSDP is also a possible solution, same as deepspeed. Maybe making the tutorials about that more discoverable would be the best solution

I think there is a requirement that even if there is sufficient gpu memory, we hope to distribute data to many cards, so as to use multiple GPUs for parallel inference. This behavior is somewhat similar to DDP, but does not involve the partitioning of parameters/states. Multiprocessing is a part of DDP, and I have essentially extracted the smallest part to achieve this. In 2023, I saw a 🤗staff at Forum mention to support this matter, and since I haven't seen any relevant features yet, I tried to implement it myself. Currently, it runs correct on many models on huggingface, with only T5 experiencing this issue. At present, it may be a bit beyond my technical stack. I hope friends in the community can work together to improve this😃

rangehow commented 2 months ago

The most difficult thing for me may be that debugging in a multi process situation is very complex, and PDB cannot set breakpoints properly. 😟

hackpk commented 2 months ago

@ArthurZucker and @rangehow can I try it out?

rangehow commented 2 months ago

@ArthurZucker and @rangehow can I try it out?

Ofcourse！just do it 🎉

huggingface / transformers