microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
34k stars 3.98k forks source link

[BUG] OPT, GPT-neo accuracy dropped when using kernel injection #3511

Open yuchen2580 opened 1 year ago

yuchen2580 commented 1 year ago

Describe the bug I'm conducting experiments with opt-1.3B, and gpt-neo2.7b on wikitext2, with the official example from huggingface and deepspeed. What I observed is that the accuracy and ppl are dropped significantly. BUT, somehow the generated tokens are almost the same. Which is super strange. So far I haven't got time and resource to test other models.

without deepspeed I get ppl= 29.76 with deepspeed I get ppl= 9190.3371

### Tasks
- [ OPT-1.3B inference ]
- [ gpt-neo2.7b inference ] 

To Reproduce Steps to reproduce the behavior:

  1. Simple inference script to reproduce: https://raw.githubusercontent.com/huggingface/transformers/main/examples/pytorch/language-modeling/run_clm.py

and add deepspeed init in-place:
ds_engine = deepspeed.init_inference( model, mp_size=world_size, dtype=torch.float,

max_out_tokens=4096,

                        replace_with_kernel_inject=True,
                        replace_method="auto"
)

model = ds_engine.module

output is the ppl and accuracy summury, so it should be easy to spot.

  1. What packages are required and their versions: transformer==4.28.1 datasets==2.11.0 evaluate==0.4.0 pytorch==1.12.0 no accelerator is used when running

  2. How to run the script

python run_clm.py \ --model_name_or_path ../resource_opt13b \ --dataset_name wikitext \ --dataset_config_name wikitext-2-raw-v1 \ --per_device_eval_batch_size 1 \ --do_eval \ --output_dir ./tmp

”../resource_opt13b“ can be replaced by huggingface model name 'e.g. opt-1.3b', I downloaded it and loaded it offline

  1. ...

Expected behavior I expect the accuracy and ppl would be the same or at least similar.

ds_report output Please run ds_report to give us details about your setup.

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

[WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. async_io ............... [NO] ....... [NO] cpu_adagrad ............ [NO] ....... [OKAY] cpu_adam ............... [NO] ....... [OKAY] fused_adam ............. [NO] ....... [OKAY] fused_lamb ............. [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] [WARNING] please install triton==1.0.0 if you want to use sparse attention sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY] utils .................. [NO] ....... [OKAY]

DeepSpeed general environment info: torch install path ............... ['/xxxxx/deepspeed/lib/python3.7/site-packages/torch'] torch version .................... 1.12.0 deepspeed install path ........... ['/xxxxx/deepspeed/lib/python3.7/site-packages/deepspeed'] deepspeed info ................... 0.9.1, unknown, unknown torch cuda version ............... 11.6 torch hip version ................ None nvcc version ..................... 11.2 deepspeed wheel compiled w. ...... torch 1.12, cuda 11.6

Screenshots If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

Docker context Are you using a specific docker image that you can share? None

Additional context Add any other context about the problem here. None

yuchen2580 commented 1 year ago

Here is my benchmark for several model I tested, experiments are done on V100, with deepspeed 4.30

BLOOM 1.7b | 18.992 / 18.979 |  
BLOOM 560M | 26.012 / 26.043 |  
GPT-neo 1.3B | 15.402 / 238.091 |  
llama 1.7b | 8.995 / 8.895 |  
OPT 1.3B | 29.760 / 9181.997 |

numbers on the left of "/" are ppl generated using naive huggingface (pytorch) numbers on the right of "/" are ppl generated using Deepspeed. Seems to me the BLOOM and llama are fine while GPT and OPT are not. Is it possible that some number causes boundary issue in kernel implementation?