Closed geekifan closed 1 week ago
i think the GPU blocks the python global interpreter lock (GIL).
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
it is not stale
I apologize for my late reply. 😢 I am busy with my project these days. I finally figured out today that it was because of my very small dataset. The dataloader loaded everything into the memory so the dataloader stopped loading during the training. It was not easy for me to locate this reason until I used a larger dataset by chance. Thanks for you reply. @bghira
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
I use a pytroch dataloader when I use trainer from transformers. Since accelerate is the dataloader backend of trainer, I think the problem is caused by accelerate.
I use a simple training script to distill CLIP model. Part of my code:
The code of dataloader:
The dataloader will load data every 1/100 of total steps. If I train for 4500 steps. the dataloader will fetch enough data. Then the dataloader stops fetching and GPUs start to training. After 45 steps, the GPUs hung and the dataloader starts to fetch data again. The GPU usage is very low due to this problem. And I think it is a bug (or maybe a designed feature?).
Expected behavior
The dataloader continues to fetch data when the gpus are running so that gpus will never stop to wait for data.