[bloomz] attn_mask return bool, but Deepspeed softmax input needs int

shenzhuo commented 1 year ago

System Info

transformers version: 4.27.1
Platform: Linux-4.18.0-240.el8.x86_64-x86_64-with-glibc2.2.5
Python version: 3.8.12
Huggingface_hub version: 0.13.3
PyTorch version (GPU?): 1.11.0+cu102 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: true
Using distributed or parallel set-up in script?: true

Who can help?

@thomasw21 @patrickvonplaten @sgugger

Information

[x] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

# using deepspeedChat example but change the opt to bloomz-1b7
# deepspeedChat github: https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/README.md

#!/bin/bash
# Copyright (c) Microsoft Corporation.
# SPDX-License-Identifier: Apache-2.0

# DeepSpeed Team
ACTOR_MODEL_PATH="bigscience/bloomz-1b7"
CRITIC_MODEL_PATH="bigscience/bloomz-1b7"
ACTOR_ZERO_STAGE=${3:-2}
CRITIC_ZERO_STAGE=${4:-2}
OUTPUT=${5:-'./output'}
NUM_GPUS=${6:-8}
NUM_NODES=${7:-1}
mkdir -p $OUTPUT

Num_Padding_at_Beginning=0 # this is model related

Actor_Lr=9.65e-6
Critic_Lr=5e-6
hostname='localhost'

export NCCL_SOCKET_IFNAME=eth
export NCCL_DEBUG=INFO
export TOKENIZERS_PARALLELISM=false

deepspeed --master_port 25303 --master_addr ${hostname} --num_gpus ${NUM_GPUS} --num_nodes ${NUM_NODES} --hostfile 'deepspeed_hostfile' main.py \
  --data_path Dahoas/rm-static \
  --data_split 2,4,4 \
  --actor_model_name_or_path $ACTOR_MODEL_PATH \
  --critic_model_name_or_path $CRITIC_MODEL_PATH \
  --num_padding_at_beginning 1 \
  --per_device_train_batch_size 1 \
  --per_device_mini_train_batch_size 1 \
  --generation_batch_numbers 1 \
  --ppo_epochs 1 \
  --max_answer_seq_len 256 \
  --max_prompt_seq_len 256 \
  --actor_learning_rate ${Actor_Lr} \
  --critic_learning_rate ${Critic_Lr} \
  --disable_actor_dropout \
  --num_train_epochs 1 \
  --lr_scheduler_type cosine \
  --gradient_accumulation_steps 1 \
  --num_warmup_steps 100 \
  --deepspeed --seed 1234 \
  --enable_hybrid_engine \
  --inference_tp_size ${NUM_NODES} \
  --tp_gather_partition_size ${NUM_GPUS} \
  --actor_zero_stage $ACTOR_ZERO_STAGE \
  --critic_zero_stage $CRITIC_ZERO_STAGE \
  --actor_gradient_checkpointing \
  --critic_gradient_checkpointing \
  --output_dir $OUTPUT |&
  tee $OUTPUT/training.log

the error is:

Traceback (most recent call last):
  File "DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 562, in <module>
    main()
  File "DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 471, in main
    out = trainer.generate_experience(prompts)
  File "DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 97, in generate_experience
    seq = self._generate_sequence(prompts)
  File "DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 73, in _generate_sequence
    seq = self.actor_model.module.generate(prompts,
  File "/dcv/lib/python3.9/site-packages/deepspeed/runtime/hybrid_engine.py", line 245, in generate
    generate_ret_vals = self._generate(*inputs, **kwargs)
  File "/dcv/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/dcv/lib/python3.9/site-packages/transformers/generation/utils.py", line 1437, in generate
    return self.greedy_search(
  File "/dcv/lib/python3.9/site-packages/transformers/generation/utils.py", line 2248, in greedy_search
    outputs = self(
  File "/dcv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1208, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/dcv/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 913, in forward
    transformer_outputs = self.transformer(
  File "/dcv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1208, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/dcv/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 786, in forward
    outputs = block(
  File "/dcv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1208, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/dcv/lib/python3.9/site-packages/deepspeed/model_implementations/transformers/ds_transformer.py", line 147, in forward
    self.attention(input,
  File "/dcv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/dcv/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/ds_attention.py", line 160, in forward
    context_layer, key_layer, value_layer = self.compute_attention(qkv_out=qkv_out,
  File "/dcv/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/ds_attention.py", line 253, in compute_attention
    attn_mask=((1 - input_mask).half() * minus_inf),
  File "/dcv/lib/python3.9/site-packages/torch/_tensor.py", line 39, in wrapped
    return f(*args, **kwargs)
  File "/dcv/lib/python3.9/site-packages/torch/_tensor.py", line 833, in __rsub__
    return _C._VariableFunctions.rsub(self, other)
RuntimeError: Subtraction, the `-` operator, with a bool tensor is not supported. If you are trying to invert a mask, use the `~` or `logical_not()` operator instead.

I want to know why this pull : https://github.com/huggingface/transformers/pull/18141/files change the following code: expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask + combined_attention_mask to: expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask | combined_attention_mask

Because of the change, the causal_mask is the tensor.bool not tensor.int64

Expected behavior

causal_mask is the tensor.int64 not tensor.bool

amyeroberts commented 1 year ago

Hi @shenzhuo,

The linked PR was closed and the commits not added - the PR introducing the change was #18344. From the PR description, it seems converting causal_mask to bool was intentional and not a side-effect. I'll let @thomasw21 explain why this change was made :)

thomasw21 commented 1 year ago

Yeah so there's no reason to pass attention_mask to be int64 since basically it stored boolean values. I think the reason why this is breaking is because of deepspeed, the forward function is overriden by custom operations on deepspeed side: https://github.com/microsoft/DeepSpeed/blame/194053bd58947ac6a45363ba780c9dfb127d3064/deepspeed/ops/transformer/inference/ds_attention.py#L168

I would suggest to fix this in DS side, ie probable changing (1 - input_mask).to(target_dtype) * minus_inf) to something like (~input_mask).to(target_type) * minus_inf

shenzhuo commented 1 year ago

Yeah so there's no reason to pass attention_mask to be int64 since basically it stored boolean values. I think the reason why this is breaking is because of deepspeed, the forward function is overriden by custom operations on deepspeed side: https://github.com/microsoft/DeepSpeed/blame/194053bd58947ac6a45363ba780c9dfb127d3064/deepspeed/ops/transformer/inference/ds_attention.py#L168

I would suggest to fix this in DS side, ie probable changing (1 - input_mask).to(target_dtype) * minus_inf) to something like (~input_mask).to(target_type) * minus_inf

I think the DeepSpeed uses (1 - input_mask).to(target_dtype) * minus_inf) because their framework is tested based on the opt model. At the same time, many modeling_x.py files in transformers return int64

thomasw21 commented 1 year ago

Hum the specific module is called BloomSelfAttention https://github.com/microsoft/DeepSpeed/blob/194053bd58947ac6a45363ba780c9dfb127d3064/deepspeed/ops/transformer/inference/ds_attention.py#L171

shenzhuo commented 1 year ago

Hum the specific module is called BloomSelfAttention https://github.com/microsoft/DeepSpeed/blob/194053bd58947ac6a45363ba780c9dfb127d3064/deepspeed/ops/transformer/inference/ds_attention.py#L171

It's a bug. I think...

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers