richarddwang / electra_pytorch

Pretrain and finetune ELECTRA with fastai and huggingface. (Results of the paper replicated !)
324 stars 41 forks source link

GPU utilization falls back to 0% when training with multiple GPUs #21

Closed PhilipMay closed 3 years ago

PhilipMay commented 3 years ago

Hi,

using multiple GPUs and DataParallel the GPU utilization falls back to 0% between the batches. I think the tokenizer is the bottleneck. Do you see a way to improve performance?

Thanks Philip

richarddwang commented 3 years ago

Preprocessing of data is precomupted and cached, the tokenizer is just to provide pad_id and other attributes afterwards, so tokenizer might not be the problem.

I haven't tried data parallelism yet, so I have no good advises for you. But I will try to manually measure the time of each part or use Pytorch profiler if I got problem like this.

PhilipMay commented 3 years ago

It mightbe that the reason is that my corpus is very small. I think it is so small that it gets computed in 4 batches. After each epoch it needs some time because of the checks it does to maybe save the checkpoint. I will report back when I know more.

PhilipMay commented 3 years ago

Is c.steps the real number of epochs? Or is it steps (batches)?

richarddwang commented 3 years ago

actual number of batches to compute

PhilipMay commented 3 years ago

Ok thanks.

When I am using more training data the utilization is much more stable.

Closing this.