Bug in TinyBert data augmentation?

gowtham1997 commented 3 years ago

Hello,

https://github.com/huawei-noah/Pretrained-Language-Model/blob/54ca698e4f907f32a108de371a42b76f92e7686d/TinyBERT/data_augmentation.py#L147-L154

In line 154, the tokenized text is sliced to len <= 512, if it exceeds 512 tokens but the corresponding tokenized_len in line 149 is not updated.

The segment_ids in the subsequent lines seem to be using the un-updated tokenized_len and causing errors in the forward pass.

 File "/lusnlsas/paramsiddhi/iitm/vinodg/glue_data_generation/plm/TinyBERT/transformer/modeling.py", line 361, in forward
    embeddings = words_embeddings + position_embeddings + token_type_embeddings
RuntimeError: The size of tensor a (512) must match the size of tensor b (763) at non-singleton dimension 1

^ this bug occurs when I try to generate data augmentations using the bert-base-cased model.

zwjyyc commented 3 years ago

Thanks! We agree with your comment and will fix this bug. Your pull request is also welcome.

gowtham1997 commented 3 years ago

@zwjyyc I've submitted a pull request for the same. Can you please help review this?

huawei-noah / Pretrained-Language-Model

Bug in TinyBert data augmentation? #141