microsoft / torchscale

Foundation Architecture for (M)LLMs
https://aka.ms/GeneralAI
MIT License
3.01k stars 202 forks source link

about attention mask #97

Closed hichoe95 closed 8 months ago

hichoe95 commented 8 months ago

In the official BEiT3 GitHub repository (https://github.com/microsoft/unilm/tree/master/beit3), they utilize a tokenizer from Hugging Face transformers. When conducting batch inference, it's necessary to pad the input texts and provide an attention mask to the model.

However, I noticed an issue in your torchscale code at https://github.com/microsoft/torchscale/blob/d51f10354d57e67be82dc660505f18322e82d4af/torchscale/architecture/encoder.py#L122. I believe the code should be reversed, like this: attn_mask = attn_mask.masked_fill(~attn_mask.to(torch.bool), -1e8)."

Because attention_mask from transformers' tokenizer give the value 0 to padded indices.