huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.38k stars 878 forks source link

Question/Bug about accelerator.gather (how to use accelerate/accelerator.gather for contrastive learning) #1154

Closed JWargrave closed 1 year ago

JWargrave commented 1 year ago

Hi, there.

I am new to accelerate and I've found that it really improves my development productivity. Thanks for your great work.

But I have some problems when using accelerator.gather.

I trained a simple resnet18 classifier on the CIFAR10 dataset. The training loop is:


for idx, (inputs, targets) in enumerate(train_loader):
    outputs = net(inputs)

    # ********************** loss plan 1 **********************
    loss = criterion(outputs, targets)
    # ********************** loss plan 1 **********************

    # ********************** loss plan 2 **********************
    # out_gather=accelerator.gather(outputs)
    # tar_gather=accelerator.gather(targets)
    # loss = criterion(out_gather, tar_gather)
    # ********************** loss plan 2 **********************

    optimizer.zero_grad()
    accelerator.backward(loss)
    optimizer.step()

The code above works well and the training accuracy reaches about 70% after 10 epochs.

But there is a problem when I train as follows:


  for idx, (inputs, targets) in enumerate(train_loader):
      outputs = net(inputs)

      # ********************** loss plan 1 **********************
      # loss = criterion(outputs, targets)
      # ********************** loss plan 1 **********************

      # ********************** loss plan 2 **********************
      out_gather=accelerator.gather(outputs)
      tar_gather=accelerator.gather(targets)
      loss = criterion(out_gather, tar_gather)
      # ********************** loss plan 2 **********************

      optimizer.zero_grad()
      accelerator.backward(loss)
      optimizer.step()

The training loss is almost unchanged, and the training accuracy has been maintained at about 10%, which is equivalent to random guessing.

The above code may look weird, but I don't think it should be wrong, but it is.

( The reason I'm doing this is that I want to use accelerate when training for contrastive learning tasks. In contrastive learning, the larger the batch_size, the better, and each sample in the batch uses all other samples in the batch as negative examples to calculate the loss. For example, when I train with four gpus and the batch_size of each gpu is 64, I want each sample to be compared with 64*4-1 negative samples instead of 64-1. In this case I need to use accelerator.gather.)

The full code is as follows: (it works well for loss plan 1 but not for loss plan 2)


# main.py
# CUDA_VISIBLE_DEVICES="0,1,2,3" accelerate launch --multi_gpu main.py

import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms
from accelerate import Accelerator

accelerator=Accelerator()

BATCH_SIZE = 256
EPOCHS = 10

if __name__ == "__main__":

    device = accelerator.device

    net = torchvision.models.resnet18(pretrained=False, num_classes=10)

    trainset = torchvision.datasets.CIFAR10(
        root="./data",
        train=True,
        download=True,
        transform=transforms.Compose(
            [
                transforms.RandomCrop(32, padding=4),
                transforms.RandomHorizontalFlip(),
                transforms.ToTensor(),
                transforms.Normalize(
                    (0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)
                ),
            ]
        ),
    )

    train_loader = torch.utils.data.DataLoader(
        trainset,
        batch_size=BATCH_SIZE,
        num_workers=4,
        pin_memory=True,
        shuffle=True
    )

    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.SGD(
        net.parameters(),
        lr=0.01 * 2,
        momentum=0.9,
        weight_decay=0.0001,
        nesterov=True,
    )

    net,optimizer,train_loader=accelerator.prepare(net,optimizer,train_loader)

    net.train()
    for ep in range(1, EPOCHS + 1):
        train_loss = correct = total = 0

        for idx, (inputs, targets) in enumerate(train_loader):
            outputs = net(inputs)

            # ********************** loss plan 1 **********************
            # loss = criterion(outputs, targets)
            # ********************** loss plan 1 **********************

            # ********************** loss plan 2 **********************
            out_gather=accelerator.gather(outputs)
            tar_gather=accelerator.gather(targets)
            loss = criterion(out_gather, tar_gather)
            # ********************** loss plan 2 **********************

            optimizer.zero_grad()
            accelerator.backward(loss)
            optimizer.step()

            train_loss += loss.item()
            total+=targets.size(0)
            correct += torch.eq(outputs.argmax(dim=1), targets).sum().item()

            print(
                "   == step: [{:3}/{}] [{}/{}] | loss: {:.3f} | acc: {:6.3f}%".format(
                    idx + 1,
                    len(train_loader),
                    ep,
                    EPOCHS,
                    train_loss / (idx + 1),
                    100.0 * correct / total,
                )
            )

I'm wondering where I'm going wrong with my code, or how I should use accelerator.gather correctly.

Thanks a lot.

sgugger commented 1 year ago

The problem is that doing the gather is not compatible with gradient propagation: it clones the tensors, so basically gradients don't flow backward. You can compute the loss on each process without the gather, the gradients will be averaged at the end of the backward pass.

JWargrave commented 1 year ago

@sgugger Thanks for your reply.

In contrastive learning, InfoNCE loss is often used as a loss function:

QQ20230307-010527@2x

So the gradients will be averaged is not equivalent to increasing the batch_size of contrastive learning.

I would like to know how to implement a gather function that preserves the gradient.

muellerzr commented 1 year ago

Quick google sleuthing gave me this for you to try @JWargrave, though I do think this may be better as a discussion post on the forums: https://discuss.huggingface.co/c/accelerate/18

https://amsword.medium.com/gradient-backpropagation-with-torch-distributed-all-gather-9f3941a381f8

JWargrave commented 1 year ago

@muellerzr Thanks a lot. I have also post the discussion at https://discuss.huggingface.co/t/question-bug-about-accelerator-gather-how-to-use-accelerate-accelerator-gather-for-contrastive-learning/33177?u=jwargrave.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.