Open geekifan opened 1 month ago
The speed shouldn't be the same, no? You're working with images, which takes much longer to load them into RAM, especially if you're doing so on a single worker unless I am mistaken. You can pillow-simd which speeds up pillow some
The speed shouldn't be the same, no? You're working with images, which takes much longer to load them into RAM, especially if you're doing so on a single worker unless I am mistaken. You can pillow-simd which speeds up pillow some
Thanks for your reply!
Of course loading on a single worker is slower than loading on multiple workers. But the biggest problem is that when I load images on a single worker, the CPU usage is much more higher than loading images on 4 workers and meanwhile the speed of single worker is 20x slower than 4 workers.
I think the expected behavior should be: loading on a single worker uses ~4x less cpu than loading on 4 workers and the speed of single worker is ~4x slower than 4 workers. The CPU usage should MATCH the CPU time.
Besides, it seems that the dataloader is NOT PREFETCHING when loading on multiple workers.
It is really weird for me to find that the dataloader starts to load data every 1/100 of total steps. It doesn't load any data when the gpu is running. Maybe the dataloader should load the data while the gpu is training?
The loading speed is ~4x less only when you set dataloader_num_workers=1
but not 0
. When set dataloader_num_workers=1
, dataloader will keep processing data when GPU is training.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
transformers
version: 4.45.2Who can help?
@muellerzr @SunMarc
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
My dataset:
Part of my training script:
The model is a simple CLIPModel.
If dataloader_num_workers=0 and dataloader_pin_memory=True, the load of cpu is around 1000 but the print speed of the debug message(see my code above) is about 1-2/sec. See the image below.
If dataloader_num_workers=4, dataloader_pin_memory=True, dataloader_prefetch_factor=2 and dataloader_persistent_workers=True, the load of cpu is around 100 and the print speed of the debug message(see my code above) is above 20/sec.
Expected behavior