lukemelas / EfficientNet-PyTorch

A PyTorch implementation of EfficientNet
Apache License 2.0
7.91k stars 1.53k forks source link

Transfer Learning not working #189

Open gost-sniper opened 4 years ago

gost-sniper commented 4 years ago

I'm trying to use pretrained model b-1 to train the model on Places365 but the training is blocking at ~25% (accuracy). I used Imagenet auto-augment policy founded here using this code: Dataloaders :


def _get_train_data_loader(batch_size, training_dir, is_distributed, **kwargs):
    logger.info(str(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S ")) + "Get train data loader")
    base_dir = '/dev/shm/places365_standard/'
    defaults.device = torch.device('cuda')

    dataset = datasets.ImageFolder(base_dir+"train", transform=transforms.Compose(
                        [transforms.Resize(224, interpolation=PIL.Image.BICUBIC), 
                         ImageNetPolicy(), 
                         transforms.ToTensor(), 
                         transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225))]))

    train_sampler = torch.utils.data.distributed.DistributedSampler(dataset) 
    return torch.utils.data.DataLoader(dataset, batch_size=batch_size, pin_memory=True, num_workers=8, sampler=train_sampler)

def _get_test_data_loader(test_batch_size, training_dir, **kwargs):
    logger.info(str(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S ")) + "Get test data loader")
    base_dir = '/dev/shm/places365_standard/'
    defaults.device = torch.device('cuda')

    dataset = datasets.ImageFolder(base_dir+"val", transform=transforms.Compose(
                        [transforms.Resize(224, interpolation=PIL.Image.BICUBIC), 
                         transforms.ToTensor(),
                         transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225))
                         ]))
    return torch.utils.data.DataLoader(dataset, batch_size=test_batch_size, num_workers=8, shuffle=True, pin_memory=True)

Training code :

    model = EfficientNet.from_pretrained('efficientnet-b1', num_classes=365).to(device)

    for n, p in model.named_parameters():
        if '_fc' not in n:
            p.requires_grad = False

    model = torch.nn.parallel.DistributedDataParallel(model)

    optimizer = optim.RMSprop(model.parameters(), lr=3e-2, alpha=0.99, 
                                                  eps=1e-08, weight_decay=1e-5, momentum=0.9)
    lmbda = lambda epoch: 0.98739
    scheduler = optim.lr_scheduler.MultiplicativeLR(optimizer, lr_lambda=lmbda)
    criterion = nn.CrossEntropyLoss()

    best_loss = 10000000

    for epoch in range(1, args.epochs + 1):
        model.train()
        for batch_idx, (data, target) in enumerate(train_loader):
            data, target = data.cuda(non_blocking=True), target.cuda(non_blocking=True)
            optimizer.zero_grad()
            output = model(data)
            loss = criterion(output, target)
            loss.backward()
            if is_distributed and not use_cuda:
                # average gradients manually for multi-machine cpu case only
                _average_gradients(model)
            optimizer.step()
            if batch_idx % (len(train_loader)-1) == 0 and batch_idx != 0:
                log = 'Train Epoch: {} [{}/{} ({:.0f}%)] Loss: {:.6f}'.format(
                    epoch, batch_idx * len(data), len(train_loader.sampler),
                    100. * batch_idx / len(train_loader), loss.item())
                logger.info(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S ") + log)

        test_loss = test(model, test_loader, device)
        scheduler.step()
        if test_loss < best_loss:
            logger.info(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S ") + "Best loss : Saving")
            save_model(model, args.model_dir)
            best_loss = test_loss

test function:


def test(model, test_loader, device):
    model.eval()
    test_loss = 0
    correct = 0
    crit = nn.CrossEntropyLoss(size_average=False)
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.cuda(non_blocking=True), target.cuda(non_blocking=True)
            output = model(data)
            test_loss += crit(output, target).item()  # sum up batch loss
            pred = output.max(1, keepdim=True)[1]  # get the index of the max log-probability
            correct += pred.eq(target.view_as(pred)).sum().item()

    test_loss /= len(test_loader.dataset)
    logger.info(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S ") + 'Test set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
        test_loss, correct, len(test_loader.dataset),
        100. * correct / len(test_loader.dataset)))
    return test_loss

I don't know what I'm doing wrong ? any help ?

ziliwang commented 4 years ago

maybe you freeze '_fc' layers

lukemelas commented 4 years ago

I'm not suer about this issue. In general, EfficientNets are very hard to train. For future reference, make sure you can:

Then return to trying to train EfficientNet on your full dataset.

gost-sniper commented 4 years ago

maybe you freeze '_fc' layers

No, I froze all but the '_fc' layer

datduong commented 4 years ago

Same. I used a different dataset, and also see accuracy being around the 25% range.

teraoka-hiroshi commented 4 years ago

I'm also training with EfficientNet-B2 using the Places365_Standard dataset. I'm training _swish from the last block (_block.22) and freezing the rest of it. I'm currently at about 40% Acc1 in the validation data, any good advice on this issue?

gost-sniper commented 3 years ago

@lukemelas's advice are very helpful try it out.

teraoka-hiroshi commented 3 years ago

@gost-sniper @lukemelas In b2, the freeze layer was implemented with blocks 20, 21, 22 and the FC layer. As a result, the percentage of correct answers increased to nearly 55% in the Placese365 standard data. Accuracy was improved by the decay of lr at the 30, 60, and 90 timings.

gost-sniper commented 3 years ago

@aporo4000 can you show the code used for the training phase?

ANYMS-A commented 3 years ago

@gost-sniper @lukemelas In b2, the freeze layer was implemented with blocks 20, 21, 22 and the FC layer. As a result, the percentage of correct answers increased to nearly 55% in the Placese365 standard data. Accuracy was improved by the decay of lr at the 30, 60, and 90 timings.

Why freeze FC layer would work? It seems that it doesn't make sense.

teraoka-hiroshi commented 3 years ago

@crissallan I made a mistake in writing In b2, blocks 20, 21, 22, and all but the FC layer were implemented as a freeze layer.

teraoka-hiroshi commented 3 years ago

@gost-sniper Less efficiently, we have created a fixed layer with

model = EfficientNet.from_pretrained(args.arch, advprop=args.advprop, num_classes=365) 
            for param in model.parameters():
                param.requires_grad = False
            for name, module in model.named_modules():
                if  name == '_blocks.20' or \
                    name == '_blocks.21' or \
                    name == '_blocks.22' or \
                    name == '_fc':
                    for param in module.parameters():
                        param.requires_grad = True
alancarlosml commented 2 years ago

Hi @gost-sniper did you fixed the problem?

Could you please share with me your training code? (alancarlosml@outlook.com)

I am facing problems with a code I made here.

Thank you!