In line 154, the tokenized text is sliced to len <= 512, if it exceeds 512 tokens but the corresponding tokenized_len in line 149 is not updated.
The segment_ids in the subsequent lines seem to be using the un-updated tokenized_len and causing errors in the forward pass.
File "/lusnlsas/paramsiddhi/iitm/vinodg/glue_data_generation/plm/TinyBERT/transformer/modeling.py", line 361, in forward
embeddings = words_embeddings + position_embeddings + token_type_embeddings
RuntimeError: The size of tensor a (512) must match the size of tensor b (763) at non-singleton dimension 1
^ this bug occurs when I try to generate data augmentations using the bert-base-cased model.
Hello,
https://github.com/huawei-noah/Pretrained-Language-Model/blob/54ca698e4f907f32a108de371a42b76f92e7686d/TinyBERT/data_augmentation.py#L147-L154
In line 154, the tokenized text is sliced to len <= 512, if it exceeds 512 tokens but the corresponding tokenized_len in line 149 is not updated.
The segment_ids in the subsequent lines seem to be using the un-updated tokenized_len and causing errors in the forward pass.
^ this bug occurs when I try to generate data augmentations using the
bert-base-cased
model.