Describe the bug I'm conducting experiments with opt-1.3B, and gpt-neo2.7b on wikitext2, with the official example from huggingface and deepspeed. What I observed is that the accuracy and ppl are dropped significantly. BUT, somehow the generated tokens are almost the same. Which is super strange. So far I haven't got time and resource to test other models.

without deepspeed I get ppl= 29.76 with deepspeed I get ppl= 9190.3371

### Tasks
- [ OPT-1.3B inference ]
- [ gpt-neo2.7b inference ]

To Reproduce Steps to reproduce the behavior:

Simple inference script to reproduce: https://raw.githubusercontent.com/huggingface/transformers/main/examples/pytorch/language-modeling/run_clm.py

and add deepspeed init in-place:
ds_engine = deepspeed.init_inference( model, mp_size=world_size, dtype=torch.float,

max_out_tokens=4096,

                        replace_with_kernel_inject=True,
                        replace_method="auto"
)

model = ds_engine.module

output is the ppl and accuracy summury, so it should be easy to spot.

What packages are required and their versions: transformer==4.28.1 datasets==2.11.0 evaluate==0.4.0 pytorch==1.12.0 no accelerator is used when running
How to run the script

python run_clm.py \ --model_name_or_path ../resource_opt13b \ --dataset_name wikitext \ --dataset_config_name wikitext-2-raw-v1 \ --per_device_eval_batch_size 1 \ --do_eval \ --output_dir ./tmp

”../resource_opt13b“ can be replaced by huggingface model name 'e.g. opt-1.3b', I downloaded it and loaded it offline

Expected behavior I expect the accuracy and ppl would be the same or at least similar.

ds_report output Please run `ds_report` to give us details about your setup.

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

[WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. async_io ............... [NO] ....... [NO] cpu_adagrad ............ [NO] ....... [OKAY] cpu_adam ............... [NO] ....... [OKAY] fused_adam ............. [NO] ....... [OKAY] fused_lamb ............. [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] [WARNING] please install triton==1.0.0 if you want to use sparse attention sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY] utils .................. [NO] ....... [OKAY]

DeepSpeed general environment info: torch install path ............... ['/xxxxx/deepspeed/lib/python3.7/site-packages/torch'] torch version .................... 1.12.0 deepspeed install path ........... ['/xxxxx/deepspeed/lib/python3.7/site-packages/deepspeed'] deepspeed info ................... 0.9.1, unknown, unknown torch cuda version ............... 11.6 torch hip version ................ None nvcc version ..................... 11.2 deepspeed wheel compiled w. ...... torch 1.12, cuda 11.6

Screenshots If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

OS: [e.g. Ubuntu 18.04]
GPU count and types: one machines with x8 V100 but I only specified 1 when running
Python version = 3.7

Docker context Are you using a specific docker image that you can share? None

Additional context Add any other context about the problem here. None

microsoft / DeepSpeed

[BUG] OPT, GPT-neo accuracy dropped when using kernel injection #3511

max_out_tokens=4096,

ds_report output Please run `ds_report` to give us details about your setup.

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

microsoft / DeepSpeed

[BUG] OPT, GPT-neo accuracy dropped when using kernel injection #3511

max_out_tokens=4096,

ds_report output Please run ds_report to give us details about your setup.

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

ds_report output Please run `ds_report` to give us details about your setup.