Closed jordane95 closed 1 year ago
Hey! Given how big the reproduction script is, I'm gonna say this is probably related to the way you are wrapping the use of transformers models, and would recommend you to ask on the forum to see if anyone in the community can help you with this! I won't have time to dive into this, maybe @younesbelkada
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Hi @jordane95 @ArthurZucker Sadly I won"t have time to dig into that :/ @jordane95 do you still face the issue on the main branch of transformers?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Hi @jordane95 @ArthurZucker Sadly I won"t have time to dig into that :/ @jordane95 do you still face the issue on the main branch of transformers?
Yeah, this seems to be a problem involved with the siamese architecture? Althogh I can avoid this error by moving loss computation operations in compute_loss
function of the trainer class to the forward
function of model class, I'm still curious why this error occurs.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
@jordane95 any idea of what happened in this error? Thanks
@anaivebird If you are using BertModel class, try to put position_ids
and token_type_ids
in model input manually. (typically torch.arange(batch_size).unsqueeze(0).expand_as(input_ids) and torch.zeros_like(input_ids)).
Hi @jordane95 @ArthurZucker Sadly I won"t have time to dig into that :/ @jordane95 do you still face the issue on the main branch of transformers?
Yeah, this seems to be a problem involved with the siamese architecture? Althogh I can avoid this error by moving loss computation operations in
compute_loss
function of the trainer class to theforward
function of model class, I'm still curious why this error occurs.
Thank you for your suggestion. I encountered the same issue as well. I’m also trying to use SimCLR or another contrastive learning framework based on the BERT class. However, I was able to resolve the problem using the method you provided.
System Info
transformers
version: 4.25.1Who can help?
@sgugger @ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I want to train a embedding-based retrieval qa system by minimizing the contrastive loss of correct (q,a) pairs against in-batch negatives. I also want it to be run on multiple gpus. But I run into the problem of backward propagation in position embedding layer of BERT (which I infer from the error log) when runing in distributed manner. I don't know where is broken (trainer? BertModel? pytorch?)
btw, the code works in single gpu setting
Command that I ran:
Error details:
Source code of
retrieval_qa.py
Expected behavior
Currently there is no problem on single gpu. I want this code to run normally on multi-gpus. But it seems somewhere is broken... It's hard to find where the problem is cause I'm not super familar with how pytorch/trainer/bertmodel works in distributed manner... Could you help me? Thanks!