Open liqi6811 opened 1 year ago
I would be very surprised if this famous BERT
model has such issue.
Could you provide the system environment like pytorch version.
You can run the command transformers-cli env
and copy-paste its output.
Actually @ydshieh I think this is pretty valid, and we have a bunch of issues with inplace operations
preventing fsdp
training. This is not limited to the embedding, have seen other places where the code fails. See the linked issue for more details.
@ArthurZucker Thanks. I know there is such problem, like I have engaged in #24525.
My main concern here: is this issue (for BERT) is only happening with TorchDistributor
(or FSDP as you said).
In #24525, it seems it happens without these other tools. And BERT exists for so long, so I am somehow confused about what exactly triggers this error.
@ydshieh system environment is below:
transformers
version: 4.29.2@ydshieh @ArthurZucker I am working in Azure Databricks, I used Horovod for distributed training, the inplace operation does not cause any issue, but Horovod 4GPU is only 1.6 times faster than 1GPU. TorchDistributor can be nearly 4 times faster. However, TorchDistributor does not work due to inplace opertaion. I tried subclassing to remove inplace operations, but not easy :). Hopefully you guys can help to release an update. Thanks a lot.
@ydshieh @ArthurZucker I would suggest to do a thorough check for all inplace operations, and get rid of all :).
System Info
databricks
Who can help?
@ArthurZucker @younesbelkada
Hi team,
I got an error message by using TorchDistributor.
I have checked in the class BertEmbeddings (url as below), line 238, embeddings += position_embeddings is an inplace operation, would you be able to change to embeddings = embeddings + position_embeddings, to allow TOrchDistributor?
BertEmbeddings url: https://github.com/huggingface/transformers/blob/main/src/transformers/models/bert/modeling_bert.py
TorchDistributor sample code: https://docs.databricks.com/_extras/notebooks/source/deep-learning/torch-distributor-notebook.html
Thank you very much! Ling
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
below error disappear.