In the official BEiT3 GitHub repository (https://github.com/microsoft/unilm/tree/master/beit3), they utilize a tokenizer from Hugging Face transformers. When conducting batch inference, it's necessary to pad the input texts and provide an attention mask to the model.
In the official BEiT3 GitHub repository (https://github.com/microsoft/unilm/tree/master/beit3), they utilize a tokenizer from Hugging Face transformers. When conducting batch inference, it's necessary to pad the input texts and provide an attention mask to the model.
However, I noticed an issue in your torchscale code at https://github.com/microsoft/torchscale/blob/d51f10354d57e67be82dc660505f18322e82d4af/torchscale/architecture/encoder.py#L122. I believe the code should be reversed, like this: attn_mask = attn_mask.masked_fill(~attn_mask.to(torch.bool), -1e8)."
Because attention_mask from transformers' tokenizer give the value 0 to padded indices.