Closed shenzhuo closed 1 year ago
Hi @shenzhuo,
The linked PR was closed and the commits not added - the PR introducing the change was #18344. From the PR description, it seems converting causal_mask
to bool
was intentional and not a side-effect. I'll let @thomasw21 explain why this change was made :)
Yeah so there's no reason to pass attention_mask
to be int64 since basically it stored boolean values. I think the reason why this is breaking is because of deepspeed
, the forward function is overriden by custom operations on deepspeed
side: https://github.com/microsoft/DeepSpeed/blame/194053bd58947ac6a45363ba780c9dfb127d3064/deepspeed/ops/transformer/inference/ds_attention.py#L168
I would suggest to fix this in DS side, ie probable changing (1 - input_mask).to(target_dtype) * minus_inf)
to something like (~input_mask).to(target_type) * minus_inf
Yeah so there's no reason to pass
attention_mask
to be int64 since basically it stored boolean values. I think the reason why this is breaking is because ofdeepspeed
, the forward function is overriden by custom operations ondeepspeed
side: https://github.com/microsoft/DeepSpeed/blame/194053bd58947ac6a45363ba780c9dfb127d3064/deepspeed/ops/transformer/inference/ds_attention.py#L168I would suggest to fix this in DS side, ie probable changing
(1 - input_mask).to(target_dtype) * minus_inf)
to something like(~input_mask).to(target_type) * minus_inf
I think the DeepSpeed uses (1 - input_mask).to(target_dtype) * minus_inf)
because their framework is tested based on the opt model. At the same time, many modeling_x.py files in transformers return int64
Hum the specific module is called BloomSelfAttention
https://github.com/microsoft/DeepSpeed/blob/194053bd58947ac6a45363ba780c9dfb127d3064/deepspeed/ops/transformer/inference/ds_attention.py#L171
Hum the specific module is called
BloomSelfAttention
https://github.com/microsoft/DeepSpeed/blob/194053bd58947ac6a45363ba780c9dfb127d3064/deepspeed/ops/transformer/inference/ds_attention.py#L171
It's a bug. I think...
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
transformers
version: 4.27.1Who can help?
@thomasw21 @patrickvonplaten @sgugger
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
the error is:
I want to know why this pull : https://github.com/huggingface/transformers/pull/18141/files change the following code:
expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask + combined_attention_mask
to:expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask | combined_attention_mask
Because of the change, the
causal_mask
is the tensor.bool not tensor.int64Expected behavior
causal_mask
is the tensor.int64 not tensor.bool