self.qkv_gemm_func returns ValueError: The deleter and context arguments are mutually exclusive.

Describe the bug I am getting the following error while attempting to run deepspeed-chat step 3 with the actor model CarperAI/openai_summarize_tldr_sft (gpt-j 6B) and critic model CarperAI/openai_summarize_tldr_rm_checkpoint (gpt-j 6B) and ZeRO stage level 2.

Traceback (most recent call last): File "main.py", line 523, in main() File "main.py", line 430, in main out = trainer.generate_experience(prompts) File "/data/nt12_ssd_gluster/myself/yts/dc/training/step3_rlhf_finetuning/ppo_trainer.py", line 97, in generate_experience seq = self._generate_sequence(prompts) File "/data/nt12_ssd_gluster/myself/yts/dc/training/step3_rlhf_finetuning/ppo_trainer.py", line 75, in _generate_sequence seq = self.actor_model.module.generate(prompts, max_length=max_min_length, min_length=max_min_length) File "/data/nt12_ssd_gluster/myself/yts/dc/training/step1_supervised_finetuning/DeepSpeed/deepspeed/runtime/hybrid_engine.py", line 254, in generate generate_ret_vals = self._generate(*inputs, kwargs) File "/data/nt12_ssd_gluster/myself/miniconda3/lib/python3.8/site-packages/transformers/generation/utils.py", line 1437, in generate return self.greedy_search( File "/data/nt12_ssd_gluster/myself/miniconda3/lib/python3.8/site-packages/transformers/generation/utils.py", line 2248, in greedy_search outputs = self( File "/data/nt12_ssd_gluster/myself/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/data/nt12_ssd_gluster/myself/miniconda3/lib/python3.8/site-packages/transformers/models/gptj/modeling_gptj.py", line 852, in forward transformer_outputs = self.transformer( File "/data/nt12_ssd_gluster/myself/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, kwargs) File "/data/nt12_ssd_gluster/myself/miniconda3/lib/python3.8/site-packages/transformers/models/gptj/modeling_gptj.py", line 687, in forward outputs = block( File "/data/nt12_ssd_gluster/myself/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, kwargs) File "/data/nt12_ssd_gluster/myself/yts/dc/training/step1_supervised_finetuning/DeepSpeed/deepspeed/model_implementations/transformers/ds_transformer.py", line 147, in forward self.attention(input,
File "/data/nt12_ssd_gluster/myself/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/data/nt12_ssd_gluster/myself/yts/dc/training/step1_supervised_finetuning/DeepSpeed/deepspeed/ops/transformer/inference/ds_attention.py", line 152, in forward qkv_out = self.qkv_func(input=input, File "/data/nt12_ssd_gluster/myself/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, kwargs)
File "/data/nt12_ssd_gluster/myself/yts/dc/training/step1_supervised_finetuning/DeepSpeed/deepspeed/ops/transformer/inference/op_binding/qkv_gemm.py", line 35, in forward output = self.qkv_gemm_func(input, weight, q_scale, bias, gamma, beta, self.config.epsilon, add_bias, ValueError: The deleter and context arguments are mutually exclusive.

ds_report output

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

async_io ............... [YES] ...... [OKAY] cpu_adagrad ............ [YES] ...... [OKAY] cpu_adam ............... [YES] ...... [OKAY] fused_adam ............. [YES] ...... [OKAY] fused_lamb ............. [YES] ...... [OKAY] quantizer .............. [YES] ...... [OKAY] random_ltd ............. [YES] ...... [OKAY] sparse_attn ............ [YES] ...... [OKAY] spatial_inference ...... [YES] ...... [OKAY] transformer ............ [YES] ...... [OKAY] stochastic_transformer . [YES] ...... [OKAY] transformer_inference .. [YES] ...... [OKAY] utils .................. [YES] ...... [OKAY]

DeepSpeed general environment info: torch install path ............... ['/data/nt12_ssd_gluster/myself/miniconda3/lib/python3.8/site-packages/torch'] torch version .................... 1.10.0+cu113 deepspeed install path ........... ['/data/nt12_ssd_gluster/myself/yts/dc/training/step1_supervised_finetuning/DeepSpeed/deepspeed'] deepspeed info ................... 0.9.1+cc67f22f, cc67f22f, master torch cuda version ............... 11.3 torch hip version ................ None nvcc version ..................... 11.3 deepspeed wheel compiled w. ...... torch 1.10, cuda 11.3

System info (please complete the following information):

OS: Ubuntu 18.04
GPU count and types: single node 8*A100
Deepspeed version: 0.9.1+cc67f22f
Python version: 3.8

The installation of Deepspeed is completed by running

TORCH_CUDA_ARCH_LIST="8.0" DS_BUILD_OPS=1 pip install . 
TORCH_CUDA_ARCH_LIST="8.0" DS_BUILD_OPS=1 pip install -e .

Additional context I would like to know if the pull request in https://github.com/microsoft/DeepSpeed/pull/3256 or some similar fixes can help with this issue.

microsoft / DeepSpeed

self.qkv_gemm_func returns ValueError: The deleter and context arguments are mutually exclusive. #3284

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible