richarddwang / electra_pytorch

Pretrain and finetune ELECTRA with fastai and huggingface. (Results of the paper replicated !)
324 stars 41 forks source link

multi-GPU training #5

Closed gchlodzinski closed 3 years ago

gchlodzinski commented 3 years ago

Hi,

First of all the code in this repo does not work out of the box - requires 2 edits (my_model to False + fix of OWT dataset name). Anyway - this model does not seem to be taking advantage of multi-GPU setups. I tried to modify it (using DataParallel) but although memory gets allocated on designated GPUs, the model is not training (maybe because my FastAi knowledge is almost none). Could you give some at least some hints how to update the code to perform multi-GPU training?

richarddwang commented 3 years ago

Hi @gchlodzinski, thanks you for pointing out. I have fixed them in this release.

As to multi-gpu, it is indeed important. I have gave a quick try to data parallel from here, but I also failed. I haven't had time to looking into it. So please tell me if you figured out anything or have questions about the implementation, because I will try multi gpu after a not short period time.

cooelf commented 3 years ago

Hi guys, did you find any possible solution?

I also tried to use the dataprallel for using multi-gpu following https://github.com/fastai/fastai/issues/1231.

learn.model = torch.nn.DataParallel(learn.model, device_ids=[0,1])

The loss turned out to be zero and the training ran quite fast. Would it be the issue about the data loading when using multi-gpu?

btw, this is an awesome repo. Thanks for your work @richarddwang

gchlodzinski commented 3 years ago

Hi,

Unfortunately I do not have time to work on it at this moment. So no - I did not resolve it yet. I will resume working on this mid Nov and I will keep you posted on my findings.

richarddwang commented 3 years ago

@cooelf That means something raised an error in the training loop. Because RusSteps callback catch any exception in the training loop it keep neglecting the exception and directly continue to next step. You may want to comment out Runsteps or set an break point before where could go wrong, to debug the issue.

gchlodzinski commented 3 years ago

Hi,

Looks like the following new version of ELECTRAModel's sample method fixes the crash and at least in my case:

    def sample(self, logits):
        "Reimplement gumbel softmax cuz there is a bug in torch.nn.functional.gumbel_softmax when fp16 (https://github.com/pytorch/pytorch/issues/41663). Gumbel softmax is equal to what official ELECTRA code do, standard gumbel dist. = -ln(-ln(standard uniform dist.))"
        if c.sampling == 'fp32_gumbel':
            sample = self.gumbel_dist.sample(logits.shape).to(logits.device)
            return (logits.float() + sample).argmax(dim=-1)
        elif c.sampling == 'fp16_gumbel':  # 5.06 ms
            sample = self.gumbel_dist.sample(logits.shape).to(logits.device)
            return (logits + sample).argmax(dim=-1)
        elif c.sampling == 'multinomial':  # 2.X ms
            return torch.multinomial(F.softmax(logits, dim=-1), 1).squeeze()

I only came back to this problem today and I am not sure if this is all that needs to be changed but I will continue working on multi-gpu pretraining this week and will let you know what I will be able to find.

PhilipMay commented 3 years ago

Hi,

Looks like the following new version of ELECTRAModel's sample method fixes the crash and at least in my case:

    def sample(self, logits):
        "Reimplement gumbel softmax cuz there is a bug in torch.nn.functional.gumbel_softmax when fp16 (https://github.com/pytorch/pytorch/issues/41663). Gumbel softmax is equal to what official ELECTRA code do, standard gumbel dist. = -ln(-ln(standard uniform dist.))"
        if c.sampling == 'fp32_gumbel':
            sample = self.gumbel_dist.sample(logits.shape).to(logits.device)
            return (logits.float() + sample).argmax(dim=-1)
        elif c.sampling == 'fp16_gumbel':  # 5.06 ms
            sample = self.gumbel_dist.sample(logits.shape).to(logits.device)
            return (logits + sample).argmax(dim=-1)
        elif c.sampling == 'multinomial':  # 2.X ms
            return torch.multinomial(F.softmax(logits, dim=-1), 1).squeeze()

I only came back to this problem today and I am not sure if this is all that needs to be changed but I will continue working on multi-gpu pretraining this week and will let you know what I will be able to find.

I can confirm that this fix works for multi GPU training.

richarddwang commented 3 years ago

Thanks all of you, it looks good. I've committed it to pretrain.py.

amritalok commented 3 years ago

I have been able to setup the multi-GPU code with the same setup however for small++ model with batch-size 16, a single GPU is much faster comparatively small++ model with batch_size 32. The GPU utilization for multi-GPU is also as below for process #121398

image

PhilipMay commented 3 years ago

The GPU utilization for multi-GPU is also as below for process #121398

@amritalok For me there is no visible GPU utilization in your screenshot. Could you please explain that?

amritalok commented 3 years ago

@PhilipMay I have updated the screenshot

PhilipMay commented 3 years ago

@amritalok I observed the same behavior when I trained with 8 V100 GPUs.