Cannot run Deepspeed inference of GPT-Neo with low_cpu_mem_usage enabled

Jiyeon1230 commented 2 years ago

Environment info

transformers version: 4.12.5
Platform: Ubuntu 18.04.6 LTS
Python version: Python 3.6.9
PyTorch version (GPU?): 1.10.0+cu113
Tensorflow version (GPU?): n/a
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: Using Deepspeed 0.5.0

Who can help

@stas00 ## Information Model I am using (Bert, XLNet ...): EleutherAI/gpt-neo-1.3B The problem arises when using: * [x] the official example scripts: (give details below) * [ ] my own modified scripts: (give details below) The tasks I am working on is: * [ ] an official GLUE/SQUaD task: (give the name) * [x] my own task or dataset: (give details below) ## To reproduce Steps to reproduce the behavior: 1. write the following code which is referred from https://www.deepspeed.ai/tutorials/inference-tutorial/#end-to-end-gpt-neo-27b-inference ```ruby import os import deepspeed import torch from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer local_rank = int(os.getenv('LOCAL_RANK', '0')) world_size = int(os.getenv('WORLD_SIZE', '1')) model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neo-1.3B", low_cpu_mem_usage=True) #model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", low_cpu_mem_usage=True) tokenizer_i = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-1.3B") #tokenizer_i = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B") generator = pipeline('text-generation', model=model, device=local_rank, tokenizer=tokenizer_i) generator.model = deepspeed.init_inference(generator.model, mp_size=world_size, dtype=torch.float, replace_method='auto') string = generator("DeepSpeed is", do_sample=True, min_length=50) if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0: print(string) ``` 2. execute the code using Deepspeed as the following command ```shell deepspeed --num_gpus 1 test.py ``` 3. failed to execute ```log Traceback (most recent call last): File "test.py", line 25, in replace_method='auto') File "/home/ubuntu/env/lib/python3.6/site-packages/deepspeed/__init__.py", line 285, in init_inference quantization_setting) File "/home/ubuntu/env/lib/python3.6/site-packages/deepspeed/inference/engine.py", line 70, in __init__ self._apply_injection_policy() File "/home/ubuntu/env/lib/python3.6/site-packages/deepspeed/inference/engine.py", line 148, in _apply_injection_policy self.quantize_groups)) File "/home/ubuntu/env/lib/python3.6/site-packages/deepspeed/module_inject/replace_module.py", line 308, in replace_transformer_layer _replace_policy=policy) File "/home/ubuntu/env/lib/python3.6/site-packages/deepspeed/module_inject/replace_module.py", line 404, in replace_module replaced_module, _ = _replace_module(model, policy) File "/home/ubuntu/env/lib/python3.6/site-packages/deepspeed/module_inject/replace_module.py", line 429, in _replace_module _, layer_id = _replace_module(child, policies, layer_id=layer_id) File "/home/ubuntu/env/lib/python3.6/site-packages/deepspeed/module_inject/replace_module.py", line 429, in _replace_module _, layer_id = _replace_module(child, policies, layer_id=layer_id) File "/home/ubuntu/env/lib/python3.6/site-packages/deepspeed/module_inject/replace_module.py", line 425, in _replace_module layer_id)) File "/home/ubuntu/env/lib/python3.6/site-packages/deepspeed/module_inject/replace_module.py", line 301, in replace_fn layer_id=layer_id) File "/home/ubuntu/env/lib/python3.6/site-packages/deepspeed/module_inject/replace_module.py", line 224, in replace_with_policy dense_w = transpose(dense_w) File "/home/ubuntu/env/lib/python3.6/site-packages/deepspeed/module_inject/replace_module.py", line 218, in transpose data.view(-1).copy_(data.transpose(-1, -2).contiguous().view(-1)) RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead. ```

Expected behavior

Run deepspeed inference successfully without any failure

Comment

Hi all,

I'm trying to run GPT-Neo inference using Deepspeed. Because of my system environment, I need to reduce the peak RAM usage, so added the argument, low_cpu_mem_usage as True to from_pretrained. But it gets failed as I described. I'm filing this case to HF because removing low_cpu_mem_usage or changing model to gpt-j-6B, it succeed to run. Could you advise for this problem? If low_cpu_mem_usage feature doesn't support GPT-Neo, it would be appreciated if you say so.

Thanks,

stas00 commented 2 years ago

Deepspeed Inference is not a completed product AFAIK, and it's not yet integrated into Transformers because of that.

As you can see from the trace the transformers library is not being used. So please re-file this issue with Deepspeed and tag @RezaYazdaniAminabadi.

Any reason why you are not using Deepspeed ZeRO Inference? https://huggingface.co/transformers/master/main_classes/deepspeed.html#deepspeed-zero-inference

Deepspeed Inference and Deepspeed ZeRO Inference are 2 completely different things. The former uses Tensor parallelism - the 2nd is ZeRO sharding.

Jiyeon1230 commented 2 years ago

@stas00 Thanks for your comment. I filed the case to transformers because the issue is not reproduced with low_cpu_mem_usage disabled. Do you think it needs to be handled by Deepspeed even so?

I'm eventually trying to load a bigger GPT-Neo like model which doesn't fit to one GPU. That's why Deepspeed Inference is used. Appreciate your advise though.

Jiyeon1230 commented 2 years ago

@stas00 I have no idea which diff fixes the issue, but it fixed after I update Deepspeed Inference code to the latest. I really appreciate your advise since I did only focus on transformers. Thanks!!!

stas00 commented 2 years ago

It's an actively developed new product, so someone must have reported this issue recently and it got fixed.

I'm glad this is now working for you, @Jiyeon1230

I'm eventually trying to load a bigger GPT-Neo like model which doesn't fit to one GPU. That's why Deepspeed Inference is used.

And I repeat again that Deepspeed ZeRO is already well tested scalability solution that you can use today to use models larger than one GPU and it's fully integrated into Transformers. It has additional features like CPU Offload, which scales better, and which I don't think Deepspeed Inference supports at the moment. See the doc link in my last comment.

But, of course, it's up to you what you use.

Jiyeon1230 commented 2 years ago

@stas00 Oh, sure! I may have misunderstood about Deepspeed ZeRO. I'll definitely look into it. Thanks for your advise.

huggingface / transformers