Can the program support longer answer_seq and prompt_seq lengths？[BUG]

I run the test program use "python train.py --actor-model facebook/opt-13b --reward-model facebook/opt-350m --num-gpus 8".The program can run normally.But I modified the parameter max answer seq_ len = 1024 and max_prompt_seq_len 1024 in run_1.3b.sh .The program reported an error. * _Time to load utils op: 0.00037217140197753906 seconds ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /data/nfs/luojiangang/DeepSpeed/DeepSpeedExamples-master/applications/DeepSpeed-Chat/training/st │ │ ep3_rlhf_finetuning/main.py:516 in │ │ │ │ 513 │ │ 514 │ │ 515 if name == "main": │ │ ❱ 516 │ main() │ │ 517 │ │ │ │ /data/nfs/luojiangang/DeepSpeed/DeepSpeedExamples-master/applications/DeepSpeed-Chat/training/st │ │ ep3_rlhf_finetuning/main.py:425 in main │ │ │ │ 422 │ │ │ │ prompts = prompts[:, length - args.max_prompt_seq_len:] │ │ 423 │ │ │ │ raise ValueError("Prompt length is too long") │ │ 424 │ │ │ │ │ ❱ 425 │ │ │ out = trainer.generate_experience(prompts) │ │ 426 │ │ │ exp_dataset = exp_mini_dataset.add(out) │ │ 427 │ │ │ │ │ 428 │ │ │ if exp_dataset is not None: │ │ │ │ /data/nfs/luojiangang/DeepSpeed/DeepSpeedExamples-master/applications/DeepSpeed-Chat/training/st │ │ ep3_rlhf_finetuning/ppo_trainer.py:97 in generate_experience │ │ │ │ 94 │ │ │ 95 │ def generate_experience(self, prompts): │ │ 96 │ │ self.eval() │ │ ❱ 97 │ │ seq = self._generate_sequence(prompts) │ │ 98 │ │ self.train() │ │ 99 │ │ │ │ 100 │ │ pad_token_id = self.tokenizer.pad_token_id │ │ │ │ /data/nfs/luojiangang/DeepSpeed/DeepSpeedExamples-master/applications/DeepSpeed-Chat/training/st │ │ ep3_rlhf_finetuning/ppo_trainer.py:73 in _generate_sequence │ │ │ │ 70 │ │ max_min_length = self.max_answer_seq_len + prompts.shape[1] │ │ 71 │ │ │ │ 72 │ │ with torch.no_grad(): │ │ ❱ 73 │ │ │ seq = self.actor_model.module.generate(prompts, │ │ 74 │ │ │ │ │ │ │ │ │ │ │ │ max_length=max_min_length, │ │ 75 │ │ │ │ │ │ │ │ │ │ │ │ min_length=max_min_length) │ │ 76 │ /opt/conda/lib/python3.8/site-packages/deepspeed/runtime/hybrid_engine.py:258 in generate │ │ │ │ 255 │ │ │ │ │ 256 │ │ │ if len(self.all_lora_params) > 0: │ │ 257 │ │ │ │ if (not self.Z3_enabled): │ │ ❱ 258 │ │ │ │ │ self.unfuse_lora_weight() │ │ 259 │ │ │ │ else: │ │ 260 │ │ │ │ │ self.unfuse_lora_weight_non_pinned() │ │ 261 │ │ │ │ self.is_lora_fused = False │ │ │ │ /opt/conda/lib/python3.8/site-packages/deepspeed/runtime/hybrid_engine.py:144 in │ │ unfuse_lora_weight │ │ │ │ 141 │ │ │ 142 │ def unfuse_lora_weight(self): │ │ 143 │ │ for layer_id in range(len(self.layer_params)): │ │ ❱ 144 │ │ │ self._unfuse_lora(self.layer_params[layer_id], self.lora_params[layer_id]) │ │ 145 │ │ │ 146 │ def unfuse_lora_weight_non_pinned(self): │ │ 147 │ │ for layer_id in range(len(self.layer_params)): │ │ │ │ /opt/conda/lib/python3.8/site-packages/deepspeed/runtime/hybrid_engine.py:140 in _unfuse_lora │ │ │ │ 137 │ │ │ │ lora_right_weight, \ │ │ 138 │ │ │ │ lora_left_weight, \ │ │ 139 │ │ │ │ lora_scaling = lora_param │ │ ❱ 140 │ │ │ │ weight.data -= lora_scaling torch.matmul(lora_left_weight.t(), lora_ri │ │ 141 │ │ │ 142 │ def unfuse_lora_weight(self): │ │ 143 │ │ for layer_id in range(len(self.layer_params)): │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP) ** Is this a bug? Perhaps there are other ways for the program to support longer answer_seq and prompt_seq lengths?We look forward to your reply

The issue you are facing is a RuntimeError related to CUDA when attempting to increase the max_answer_seq_len and max_prompt_seq_len to 1024 in the run_1.3b.sh script. The error seems to originate from the cublasGemmEx function call.

This error might be due to a lack of GPU memory when handling larger sequence lengths. Here are some suggestions to resolve the issue:

Reduce the batch size: Decreasing the batch size can help to manage GPU memory usage when working with larger sequences. Update the --train-batch-size and/or --eval-batch-size in the run_1.3b.sh script accordingly.
Gradient accumulation: If reducing the batch size is not sufficient or leads to degraded model performance, consider using gradient accumulation. This approach updates model weights less frequently, effectively simulating larger batch sizes while using less memory. Add or modify the --gradient-accumulation-steps argument in the run_1.3b.sh script to use gradient accumulation.
Model parallelism: If neither of the above solutions is sufficient, you may need to use model parallelism to distribute the model across multiple GPUs. This can be done by enabling DeepSpeed's model parallelism features, such as Megatron or ZeRO. Check the DeepSpeed documentation and examples for guidance on how to implement model parallelism.
Optimize CUDA settings: The RuntimeError you encountered is related to the cublasGemmEx function call. It might be possible to optimize CUDA settings to avoid the error. However, this approach requires a deeper understanding of CUDA and DeepSpeed internals and might not be the most straightforward solution.
Before attempting any of these solutions, ensure that your GPU drivers and CUDA toolkit are up-to-date and compatible with the DeepSpeed library. Additionally, double-check that your GPU has sufficient memory to handle the increased sequence lengths.

Keep in mind that increasing sequence lengths will likely result in higher memory consumption and longer training times. If the issue persists after trying the above suggestions, consider providing more information in the GitHub issue, such as your GPU model, GPU memory, and any other relevant context to help the DeepSpeed maintainers identify the root cause and offer a more specific solution.

microsoft / DeepSpeed

Can the program support longer answer_seq and prompt_seq lengths？[BUG] #3238