Closed FrancescoSaverioZuppichini closed 1 year ago
I think the result has shown the perf boost by caching.
First iteration took 9s, all the others 4s. Why? Shouldn't it be cached?
Your pipeline is relatively simple, so I think the major overhead is the data passed between main process and worker processes. So, you won't be able to observe the perf gain that significantly when running everything in the main process
@ejguan any way for me to check it? using ffcv resulted in a similar speed as the second snipper
@FrancescoSaverioZuppichini Sorry, what do you mean? I feel your pipeline is better to use DataLoader with 0 workers to get rid of multiprocessing pieces.
Don't you need multiple workers to speed things up and preload batches to GPU?
It's all about tradeoff, right? Your current pipeline suffers more due to multiprocessing rather than getting benefit from it. And, multi-worker won't help to preload batches to GPU.
For DataLoader, turning pin_memory=True
would help to move batches to a shared memory first. Then, it will be minimum cost to move the Tensor from the shared memory to GPU.
@ejguan sure, I don't have any opinion about best practices here I am just wondering what's the best way to do things. Python docs suggest dataloader + multi workers is the way to go so I'd like to know if I should apply the same approach with torchdata.
I was wondering if you may give me a little more context about And, multi-worker won't help to preload batches to GPU.
multi-worker won't help to preload batches to GPU.
DataLoader won't move data to GPU no matter with multiple workers or not. And, for TorchData, we do have a plan to support GPU operations in the future. But, it's still under discussion.
Thanks a lot for the highlights!
Closing it for now. For GPU operations, we have an opened issue https://github.com/pytorch/data/issues/761 already
Please feel free to reopen it if you have further issue on the same topic.
🐛 Describe the bug
Weird behaviour of
InMemoryCacheHolder
not really speeding things upFirst iteration took 9s, all the others 4s. Why? Shouldn't it be cached?
Output
If I set
num_workers=1
, the first iteration is faster, and then all the others are the sameIf I use
.batch(32)
, useless in RL since to my understand I need more workers to prepare the next batches, I see a speed upThanks!
Versions