A possible bug in Data Process

jingtaozhan / DRhard

SIGIR'21: Optimizing DR with hard negatives and achieving SOTA first-stage retrieval performance on TREC DL Track.

BSD 3-Clause "New" or "Revised" License

125 stars 14 forks source link

A possible bug in Data Process #23

Closed Lancelot39 closed 2 years ago

Lancelot39 commented 2 years ago

Thanks for your released beautiful and easy-to-follow code ! It is very helpful to the dense retrieval researchers as me.

However, maybe I have met a bug in the preprocess.py. As the model requires to use the RoBERTa-base model for initialization, the native pad token is 1 but not 0. While I found that in this file, the pad token is set to 0 (the [CLS] token id), which has caused that the pre-trained model checkpoint can not achieve the same results as reported in your paper. https://github.com/jingtaozhan/DRhard/blob/main/preprocess.py#L23

jingtaozhan commented 2 years ago

Thanks for the nice words and for raising the issue.

This is indeed a `bug'. Nevertheless, the pad_token_id is actually not important because attention_mask will mask these pad tokens. Because of attention_mask, I think pad tokens can be any tokens. I wonder whether this actually causes performance loss. What are your reproduced results?

Best

Lancelot39 commented 2 years ago

Sorry for later response.

I agree that the pad token can be any tokens. But once I use the following code to build the attention mask, a possible bug would happen, where the CLS position will be also masked. mask = input_ids == pad_token I have successfully reproduced the results as the paper mentioned, while found this problem when I was going to revise the code for using other dense retrieval methods. Anyway, it is not a serious problem but a possible bug for users who are not familiar with this task. I just raise this issue for a notification. :)

Best