Closed Lancelot39 closed 2 years ago
Thanks for the nice words and for raising the issue.
This is indeed a `bug'. Nevertheless, the pad_token_id is actually not important because attention_mask will mask these pad tokens. Because of attention_mask, I think pad tokens can be any tokens. I wonder whether this actually causes performance loss. What are your reproduced results?
Best
Sorry for later response.
I agree that the pad token can be any tokens. But once I use the following code to build the attention mask, a possible bug would happen, where the CLS position will be also masked.
mask = input_ids == pad_token
I have successfully reproduced the results as the paper mentioned, while found this problem when I was going to revise the code for using other dense retrieval methods. Anyway, it is not a serious problem but a possible bug for users who are not familiar with this task. I just raise this issue for a notification. :)
Best
Thanks for your released beautiful and easy-to-follow code ! It is very helpful to the dense retrieval researchers as me.
However, maybe I have met a bug in the preprocess.py. As the model requires to use the RoBERTa-base model for initialization, the native pad token is 1 but not 0. While I found that in this file, the pad token is set to 0 (the [CLS] token id), which has caused that the pre-trained model checkpoint can not achieve the same results as reported in your paper. https://github.com/jingtaozhan/DRhard/blob/main/preprocess.py#L23