Closed PhilipMay closed 3 years ago
Preprocessing of data is precomupted and cached, the tokenizer is just to provide pad_id
and other attributes afterwards, so tokenizer might not be the problem.
I haven't tried data parallelism yet, so I have no good advises for you. But I will try to manually measure the time of each part or use Pytorch profiler if I got problem like this.
It mightbe that the reason is that my corpus is very small. I think it is so small that it gets computed in 4 batches. After each epoch it needs some time because of the checks it does to maybe save the checkpoint. I will report back when I know more.
Is c.steps
the real number of epochs? Or is it steps (batches)?
actual number of batches to compute
Ok thanks.
When I am using more training data the utilization is much more stable.
Closing this.
Hi,
using multiple GPUs and DataParallel the GPU utilization falls back to 0% between the batches. I think the tokenizer is the bottleneck. Do you see a way to improve performance?
Thanks Philip