Open rangehow opened 2 months ago
t5 is a fairly old model, this is probably expected? If you find a fix feel free to open a PR! 🤗
t5 is a fairly old model, this is probably expected? If you find a fix feel free to open a PR! 🤗
Yes, but strangely enough, Bart supports it. I would be happy to give it a try, but before that, I would like to know if this issue can be reproduced? This will help me further reduce the scope of investigation.
To be honest, multiprocessing is outside of the scope of transformers, and we usually recommend the usage of accelerate
😉 . FSDP is also a possible solution, same as deepspeed. Maybe making the tutorials about that more discoverable would be the best solution
To be honest, multiprocessing is outside of the scope of transformers, and we usually recommend the usage of
accelerate
😉 . FSDP is also a possible solution, same as deepspeed. Maybe making the tutorials about that more discoverable would be the best solution
I think there is a requirement that even if there is sufficient gpu memory, we hope to distribute data to many cards, so as to use multiple GPUs for parallel inference. This behavior is somewhat similar to DDP, but does not involve the partitioning of parameters/states. Multiprocessing is a part of DDP, and I have essentially extracted the smallest part to achieve this. In 2023, I saw a 🤗staff at Forum mention to support this matter, and since I haven't seen any relevant features yet, I tried to implement it myself. Currently, it runs correct on many models on huggingface, with only T5 experiencing this issue. At present, it may be a bit beyond my technical stack. I hope friends in the community can work together to improve this😃
The most difficult thing for me may be that debugging in a multi process situation is very complex, and PDB cannot set breakpoints properly. 😟
@ArthurZucker and @rangehow can I try it out?
@ArthurZucker and @rangehow can I try it out?
Ofcourse!just do it 🎉
System Info
transformers
version: 4.39.0.dev0Who can help?
Hi, @ArthurZucker and @younesbelkada . I'm trying to split a dataset automatically to multi gpu (a bit like data parallel) for inference. But strange things happen when using t5 model in hf while other models work correctly(i.e. bart), so I guess here exist some problem related to t5 implementation, would you like help checking it out? :)
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
The following code should be quite easy to reproduce. All you need to do is replace the model_dir in the main function with a specific model, such as Google/t5-v1_1-large , and make sure CUDA VISIBLE DEVICES >1 .
Expected behavior
t5 model can inference in multiprocessing.