kuangliu / pytorch-cifar

95.47% on CIFAR10 with PyTorch
MIT License
5.96k stars 2.14k forks source link

Accuracy of Resnet50 is much higher than reported! #45

Open Erotemic opened 6 years ago

Erotemic commented 6 years ago

EDIT: Originally I title this issue "What epoch are reported results from?", but after further results have come to light I've renamed it to: "Accuracy of Resnet50 is much higher than reported!".

I've been reproducing some of these experiments and my output numbers don't exactly line up. At the moment I'm currently assuming its due to different random number seeds for the Kaiming Normal initialization.

Is the accuracy you report always the accuracy of the test set on the 350th epoch? Or are you reporting the accuracy of the best epoch?

In my reproduction of the DPN92 experiment I my measured accuracy of the last epoch is 94.92%, but the highest overall accuracy was 95.10% on epoch 275.

Surprisingly, when I ran the Resnet50 example I got an accuracy of 95.72%! (but this is likely some issue on my end) (Edit: Actually it doesn't seem to be; see bellow)

Erotemic commented 6 years ago

Even more surprising is that that resnet50 example is reproducible:

I get the following error rates for densenet: 5.33%, dpn92: 5.08%, resnet50: 4.46%. Subtract these from 100 to get the accuracy.

screenshot from 2018-06-13 09-25-51

Erotemic commented 6 years ago

I've gone as far as to reproduce the Resnet50 results using (close to) the original training scripts. I only modified them to add logging, and so they used the same learning rate schedule as reported here without manually changing the learning rate.

Using essentially the original code I achieve an accuracy of 95.370% with Resent50. AFAIK, this is a state-of-the-art result on CIFAR-10 for a single network trained from scratch with basic data augmentation. I'm interested in digging into this a bit further. I find it very strange (but not inconcevable) that a simple Resnet50 architecture outperforms DPN92. Why is this the case now, but not before? Did kuangliu make an error in measuring accuracy? Perhaps the manual learning rate schedule was different than what was reported? Did something in torch change? Before digging in too much, it would be useful if someone else can reproduce my findings.

This is the diff between master and the code I used. This is most is boilerplate junk. The only important bit is at the end, which defines the non-manual learning rate schedule.

diff --git a/main.py b/main.py
index 26a4e98..44be9b7 100644
--- a/main.py
+++ b/main.py
@@ -15,14 +15,19 @@ import argparse

 from models import *
 from utils import progress_bar
+import ubelt as ub

+# parser = argparse.ArgumentParser(description='PyTorch CIFAR10 Training')
+# parser.add_argument('--lr', default=0.1, type=float, help='learning rate')
+# parser.add_argument('--resume', '-r', action='store_true', help='resume from checkpoint')
+# args = parser.parse_args()

-parser = argparse.ArgumentParser(description='PyTorch CIFAR10 Training')
-parser.add_argument('--lr', default=0.1, type=float, help='learning rate')
-parser.add_argument('--resume', '-r', action='store_true', help='resume from checkpoint')
-args = parser.parse_args()
+# device = 'cuda' if torch.cuda.is_available() else 'cpu'
+
+model = ub.argval('--model')
+gpu_num = int(ub.argval('--gpu', default=0))
+device = torch.device('cuda', gpu_num)

-device = 'cuda' if torch.cuda.is_available() else 'cpu'
 best_acc = 0  # best test accuracy
 start_epoch = 0  # start from epoch 0 or last checkpoint epoch

@@ -51,7 +56,18 @@ classes = ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship'
 # Model
 print('==> Building model..')
 # net = VGG('VGG19')
-net = ResNet18()
+# net = ResNet18()
+
+model = ub.argval('--model')
+
+if model == 'resnet50':
+    net = ResNet50()
+elif model == 'dpn92':
+    net = DPN92()
+elif model == 'densenet121':
+    net = DenseNet121()
+else:
+    raise KeyError(model)
 # net = PreActResNet18()
 # net = GoogLeNet()
 # net = DenseNet121()
@@ -66,17 +82,21 @@ if device == 'cuda':
     net = torch.nn.DataParallel(net)
     cudnn.benchmark = True

-if args.resume:
-    # Load checkpoint.
-    print('==> Resuming from checkpoint..')
-    assert os.path.isdir('checkpoint'), 'Error: no checkpoint directory found!'
-    checkpoint = torch.load('./checkpoint/ckpt.t7')
-    net.load_state_dict(checkpoint['net'])
-    best_acc = checkpoint['acc']
-    start_epoch = checkpoint['epoch']
+# if args.resume:
+#     # Load checkpoint.
+#     print('==> Resuming from checkpoint..')
+#     assert os.path.isdir('checkpoint'), 'Error: no checkpoint directory found!'
+#     checkpoint = torch.load('./checkpoint/ckpt.t7')
+#     net.load_state_dict(checkpoint['net'])
+#     best_acc = checkpoint['acc']
+#     start_epoch = checkpoint['epoch']

 criterion = nn.CrossEntropyLoss()
-optimizer = optim.SGD(net.parameters(), lr=args.lr, momentum=0.9, weight_decay=5e-4)
+# optimizer = optim.SGD(net.parameters(), lr=args.lr, momentum=0.9, weight_decay=5e-4)
+optimizer = optim.SGD(net.parameters(), lr=0.1, momentum=0.9, weight_decay=5e-4)
+
+logfile = open('./logfile_{}.txt'.format(model), 'a')
+

 # Training
 def train(epoch):
@@ -98,8 +118,10 @@ def train(epoch):
         total += targets.size(0)
         correct += predicted.eq(targets).sum().item()

-        progress_bar(batch_idx, len(trainloader), 'Loss: %.3f | Acc: %.3f%% (%d/%d)'
-            % (train_loss/(batch_idx+1), 100.*correct/total, correct, total))
+        msg = 'Loss: %.3f | Acc: %.3f%% (%d/%d)' % (train_loss/(batch_idx+1), 100.*correct/total, correct, total)
+        progress_bar(batch_idx, len(trainloader), msg)
+    logfile.write('Train epoch {}. {}'.format(epoch, msg))
+

 def test(epoch):
     global best_acc
@@ -118,8 +140,9 @@ def test(epoch):
             total += targets.size(0)
             correct += predicted.eq(targets).sum().item()

-            progress_bar(batch_idx, len(testloader), 'Loss: %.3f | Acc: %.3f%% (%d/%d)'
-                % (test_loss/(batch_idx+1), 100.*correct/total, correct, total))
+            msg = 'Loss: %.3f | Acc: %.3f%% (%d/%d)' % (test_loss/(batch_idx+1), 100.*correct/total, correct, total)
+            progress_bar(batch_idx, len(testloader), msg)
+        logfile.write('Test epoch {}. {}'.format(epoch, msg))

     # Save checkpoint.
     acc = 100.*correct/total
@@ -136,6 +159,20 @@ def test(epoch):
         best_acc = acc

-for epoch in range(start_epoch, start_epoch+200):
-    train(epoch)
-    test(epoch)
+schedule = [
+    (0, 150, .1),
+    (150, 250, .01),
+    (250, 350, .001),
+]
+
+for start, end, lr in schedule:
+
+    # Set learning rate
+    for param_group in optimizer.param_groups:
+        param_group['lr'] = lr
+
+    # Run for awhile
+    for epoch in range(start, end):
+        train(epoch)
+        if epoch > 200:
+            test(epoch)
Erotemic commented 6 years ago

I've finished runs of kuangliu's code for ResNet50, DenseNet121, and DPN92.

I did recently find a bug in netharn, that would cause the learning rate schedule to be slightly off, which I assume accounts for the discrepancy between my netharn numbers and when I reran kuangliu's scripts. However, the discrepancy between my runs of ResNet50 and kuangliu's is still very real.

Here is my results table:

          model |  kuangliu  | rerun-kuangliu  |  netharn |
    -------------------------------------------------------
    ResNet50    |    93.62%  |         95.370% |  95.72%  |  <- how did that happen?
    DenseNet121 |    95.04%  |         95.420% |  94.47%  |
    DPN92       |    95.16%  |         95.410% |  94.92%  |

The first column is kuangliu's reported accuracy, the second column is me running kuangliu's code, and the final column is using my own training harness (handles logging and whatnot) called netharn.

EDIT: I recently learned that CuDNN is non-determenistic by default, so that can also account for some minor discrepancy between results, but I don't think its enough to explain away these findings.

AlphaQi commented 6 years ago

我使用作者的code进行训练,使用20层的resnet,竟然达到了95.04%的准确率,表示震惊,这还没有用其他技巧

willyqin commented 6 years ago

Accuracy of ResNet18 is much higher than reported too.

willyqin commented 6 years ago

我使用作者的code进行训练,使用20层的resnet,竟然达到了95.04%的准确率,表示震惊,这还没有用其他技巧

我用ResNet18跑出来的效果也超过了95%.

jingege315 commented 5 years ago

我使用作者的code进行训练,使用20层的resnet,竟然达到了95.04%的准确率,表示震惊,这还没有用其他技巧

我用ResNet18跑出来的效果也超过了95%.

Me too. The val accuracy is 94.1% using ResNet18. But I get a so bad result (87%) using my code with the same structure in torch

willyqin commented 5 years ago

the ResNet structure of this code is not the standard structure. More channels are used.

Hui notifications@github.com 于2018年11月28日周三 下午6:47写道:

我使用作者的code进行训练,使用20层的resnet,竟然达到了95.04%的准确率,表示震惊,这还没有用其他技巧

我用ResNet18跑出来的效果也超过了95%.

Me too. The val accuracy is 94.1% using ResNet18. But I get a so bad result (87%) using my code with the same structure in torch

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kuangliu/pytorch-cifar/issues/45#issuecomment-442403330, or mute the thread https://github.com/notifications/unsubscribe-auth/AKMDCur0apz2IywzwzY1wVJnxfp-x8a9ks5uzmmvgaJpZM4UlGI9 .

Juicechen95 commented 5 years ago

In Torch7, the Resnet structure only has 16 channels in the first layer, but this has 64 channels.

michaelrzhang commented 5 years ago

Yeah, you can get 95% with resnet18.

xxllp commented 5 years ago

我还以为遇到鬼了,这个acc是分类acc的平均值还是总体的

wangyongjie-ntu commented 5 years ago

But In fact, I just clone this repo and run the code with "python main.py". I don't modify anything of the original code. The accuracy is just about 80~85%. Quite confusing. Can anybody share your training details?

Erotemic commented 5 years ago

Odd, perhaps its important to have the particular checkout I used (bf78d3b8b358c4be7a25f9f9438c842d837801fd) and maybe apply the diff that I specify in an earlier post?

Also note that cudnn is non-deterministic by default, so there could be issues there. So many little things to consider with these deep models that can cause large differences!

triangleCZH commented 5 years ago

My observation: he doesn't have the first conv with kernel-7, and the following maxpooling layer. In total it's downsampled by three times 3232 -> 44, while for a normal ResNet it will be 3232 -> 11. It increases FLOPS, reduces (a little bit) params, and enhance the acc a lot.

xuw080 commented 5 years ago

But In fact, I just clone this repo and run the code with "python main.py". I don't modify anything of the original code. The accuracy is just about 80~85%. Quite confusing. Can anybody share your training details?

You need to manually change the learning rate around epoch80 and epoch 100, then u can get reported results.

jizongFox commented 5 years ago

Accuracy of ResNet18 is much higher than reported too.

I had 95.5% for resnet 18

MaureenZOU commented 4 years ago

I also had 95% on resnet 18. This is extremely strange. This number is far different from what I could get from Kaiming's paper and also other person's paper. I have checked https://github.com/weiaicunzai/pytorch-cifar100 this implementation, it give top-1 accuracy on resnet18 with 94.91. Which is similar with this version. I just feel strange these numbers are not in the paper.

fruffy commented 4 years ago

Can anyone that achieves 95% accuracy publish their exact configuration? Maybe just the raw code. That would be very helpful. I am unfortunately unable to reproduce these numbers, even with the methods named in this thread.

ZHUANGHP commented 4 years ago

I had over 95% with the model of this post with initial lr 0.1 momentum 0.9 weight decay 5e-4, decay of lr by 10 in 150 225 270 for a total of 300 epochs.

MaureenZOU commented 4 years ago

Me too.....

fruffy commented 4 years ago

Okay, I was able to achieve 95% now, thanks! Not sure what is going on...

liuyao12 commented 4 years ago

Yeah, and it doesn't actually need 300 epochs; I just downgrade lr when it plateaus. It could very well be that 95% is the limit. Too bad I was hoping to test on an idea, which shows slightly faster improvement at the early stage but arrives at about the same final acc. If anyone is interested it is an easy add-on to ResNet: https://github.com/liuyao12/pytorch-cifar/blob/master/cifar10_with_PDE.ipynb

ZHUANGHP commented 4 years ago

@liuyao12 agreed. I think 200 epochs or least should suffice. But let the learning curve plateaus a bit longer seems to benefit the validation accuracy. It is from my experiments, not a very strong claim though.

askerlee commented 4 years ago

@triangleCZH Thanks for pointing that out. We can also resize the input images to 64x64, and remove the first pooling layer. The final feature maps will also be 4x4, and 94%+ accuracy is easily achieved. 🍻

HtutLynn commented 4 years ago

I had over 95% with the model of this post with initial lr 0.1 momentum 0.9 weight decay 5e-4, decay of lr by 10 in 150 225 270 for a total of 300 epochs.

I used a bit different strategy from kaungliu's training code. Instead of manual tuning of learning rates, I used MultiStepLR to change the learning rate when the training process reaches to those epochs. Still the best test accuracy I got for ResNet18 was around 88%. Is it different from manually changing the learning rate?

ZHUANGHP commented 4 years ago

I used a bit different strategy from kaungliu's training code. Instead of manual tuning of learning rates, I used MultiStepLR to change the learning rate when the training process reaches to those epochs. Still the best test accuracy I got for ResNet18 was around 88%. Is it different from manually changing the learning rate?

If the strategy is identical, there must be something wrong in the code.

HtutLynn commented 4 years ago

I used a bit different strategy from kaungliu's training code. Instead of manual tuning of learning rates, I used MultiStepLR to change the learning rate when the training process reaches to those epochs. Still the best test accuracy I got for ResNet18 was around 88%. Is it different from manually changing the learning rate?

If the strategy is identical, there must be something wrong in the code.

Thanks for the reply. Here's my code for training cifar-10. I think it is pretty much the same aside from using lr_scheduler.MultiStepLR for learning rate tuning instead of manual tuning.

` Training code

# Set the model into train mode
model.train()

train_loss = 0
correct = 0
total = 0
datacount = len(dataloader)

for batch_idx, (train_batch, labels_batch) in enumerate(dataloader):

    # move the data onto the device
    train_batch, labels_batch = train_batch.to(device), labels_batch.to(device)

    # # convert to torch Variables
    # train_batch, labels_batch = Variable(train_batch), Variable(labels_batch)

    # clear the previous grad 
    optimizer.zero_grad()

    # compute model outputs and loss
    outputs = model(train_batch)
    loss = loss_fn(outputs, labels_batch)
    loss.backward()

    # after computing gradients based on current batch loss,
    # apply them to parameters
    optimizer.step()

    train_loss += loss.item()
    _, predicted = outputs.max(1)
    total += labels_batch.size(0)
    correct += predicted.eq(labels_batch).sum().item()
    # get learning rate
    current_lr = get_lr(optimizer=optimizer)

    # write to tensorboard
    writer.add_scalar('train/loss', train_loss/(batch_idx+1), (datacount * (epoch+1)) + (batch_idx+1))
    writer.add_scalar('train/accuracy', 100.*correct/total, (datacount * (epoch+1)) + (batch_idx+1))
    writer.add_scalar('Learning rate', current_lr)

    progress_bar(batch_idx, len(dataloader), 'Train Loss: %.3f | Train Acc: %.3f%% (%d/%d)'
                 % (train_loss/(batch_idx+1), 100.*correct/total, correct, total))`

` Testing code

model.eval()

test_loss = 0
correct = 0
total = 0
datacount = len(dataloader)

# check global variable `best_accuracy`
global best_accuracy

with torch.no_grad():
    for batch_idx, (test_batch, labels_batch) in enumerate(dataloader):

        # move the data onto device
        test_batch, labels_batch = test_batch.to(device), labels_batch.to(device)

        # compute the model output
        outputs = model(test_batch)
        loss = loss_fn(outputs, labels_batch)

        test_loss += loss.item()
        _, predicted = outputs.max(1)
        total += labels_batch.size(0)
        correct += predicted.eq(labels_batch).sum().item()

        # log the test_loss
        writer.add_scalar('test/loss', test_loss/(batch_idx+1), (datacount * (epoch+1)) + (batch_idx+1))
        writer.add_scalar('test/accuracy', 100.*correct/total, (datacount * (epoch+1)) + (batch_idx+1))

        progress_bar(batch_idx, len(dataloader), 'Test Loss: %.3f | Test Acc: %.3f%% (%d/%d)'
                     % (test_loss/(batch_idx+1), 100.*correct/total, correct, total))

`

`
main

transform_train = transforms.Compose([
    transforms.RandomCrop(32, padding=4),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
])

transform_test = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
])

# The dataset that we are going to train the network is : CIFAR-10 dataset

trainset = torchvision.datasets.CIFAR10(root='/home/htut/Desktop/Knowledge_Distillation_Pytorch/datasets', train=True,
                                        download=True, transform=transform_train)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=128,
                                        shuffle=True, num_workers=4)

testset = torchvision.datasets.CIFAR10(root="/home/htut/Desktop/Knowledge_Distillation_Pytorch/datasets", train=False,
                                        download=True, transform=transform_test)
testloader = torch.utils.data.DataLoader(testset, batch_size=100,
                                        shuffle=False, num_workers=4)

classes = ('plane', 'car', 'bird', 'cat', 'deeer',
            'dog', 'frog', 'horse', 'ship', 'truck')

# setup device for training
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# setup Tensorboard file path
writer = SummaryWriter('experiments/teachers/resnet/resnet50')

# Setup best accuracy for comparing and model checkpoints
best_accuracy = 0.0

# Configure the Network

# You can swap out any kind of architectire from /models in here
model_fn = ResNet50()
model_fn = model_fn.to(device)

# print summary of model
summary(model_fn, (3, 32, 32))
# Setup the loss function
criterion = nn.CrossEntropyLoss()

# Setup the optimizer method for all the parameters
optimizer_fn = optim.SGD(model_fn.parameters(), lr=0.1, momentum=0.9, weight_decay=5e-4)

# setup learning rate scheduler 
scheduler = MultiStepLR(optimizer_fn, milestones=[150, 225, 270], gamma=0.1)

train_and_evaluate(model=model_fn, train_dataloader=trainloader, test_dataloader=testloader,
                    optimizer=optimizer_fn, scheduler=scheduler, loss_fn=criterion, total_epochs=300)

writer.close()`
ZHUANGHP commented 4 years ago

I used a bit different strategy from kaungliu's training code. Instead of manual tuning of learning rates, I used MultiStepLR to change the learning rate when the training process reaches to those epochs. Still the best test accuracy I got for ResNet18 was around 88%. Is it different from manually changing the learning rate?

If the strategy is identical, there must be something wrong in the code.

Thanks for the reply. Here's my code for training cifar-10. I think it is pretty much the same aside from using lr_scheduler.MultiStepLR for learning rate tuning instead of manual tuning.

` Training code

# Set the model into train mode
model.train()

train_loss = 0
correct = 0
total = 0
datacount = len(dataloader)

for batch_idx, (train_batch, labels_batch) in enumerate(dataloader):

    # move the data onto the device
    train_batch, labels_batch = train_batch.to(device), labels_batch.to(device)

    # # convert to torch Variables
    # train_batch, labels_batch = Variable(train_batch), Variable(labels_batch)

    # clear the previous grad 
    optimizer.zero_grad()

    # compute model outputs and loss
    outputs = model(train_batch)
    loss = loss_fn(outputs, labels_batch)
    loss.backward()

    # after computing gradients based on current batch loss,
    # apply them to parameters
    optimizer.step()

    train_loss += loss.item()
    _, predicted = outputs.max(1)
    total += labels_batch.size(0)
    correct += predicted.eq(labels_batch).sum().item()
    # get learning rate
    current_lr = get_lr(optimizer=optimizer)

    # write to tensorboard
    writer.add_scalar('train/loss', train_loss/(batch_idx+1), (datacount * (epoch+1)) + (batch_idx+1))
    writer.add_scalar('train/accuracy', 100.*correct/total, (datacount * (epoch+1)) + (batch_idx+1))
    writer.add_scalar('Learning rate', current_lr)

    progress_bar(batch_idx, len(dataloader), 'Train Loss: %.3f | Train Acc: %.3f%% (%d/%d)'
                 % (train_loss/(batch_idx+1), 100.*correct/total, correct, total))`

` Testing code

model.eval()

test_loss = 0
correct = 0
total = 0
datacount = len(dataloader)

# check global variable `best_accuracy`
global best_accuracy

with torch.no_grad():
    for batch_idx, (test_batch, labels_batch) in enumerate(dataloader):

        # move the data onto device
        test_batch, labels_batch = test_batch.to(device), labels_batch.to(device)

        # compute the model output
        outputs = model(test_batch)
        loss = loss_fn(outputs, labels_batch)

        test_loss += loss.item()
        _, predicted = outputs.max(1)
        total += labels_batch.size(0)
        correct += predicted.eq(labels_batch).sum().item()

        # log the test_loss
        writer.add_scalar('test/loss', test_loss/(batch_idx+1), (datacount * (epoch+1)) + (batch_idx+1))
        writer.add_scalar('test/accuracy', 100.*correct/total, (datacount * (epoch+1)) + (batch_idx+1))

        progress_bar(batch_idx, len(dataloader), 'Test Loss: %.3f | Test Acc: %.3f%% (%d/%d)'
                     % (test_loss/(batch_idx+1), 100.*correct/total, correct, total))

`

` main

transform_train = transforms.Compose([
    transforms.RandomCrop(32, padding=4),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
])

transform_test = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
])

# The dataset that we are going to train the network is : CIFAR-10 dataset

trainset = torchvision.datasets.CIFAR10(root='/home/htut/Desktop/Knowledge_Distillation_Pytorch/datasets', train=True,
                                        download=True, transform=transform_train)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=128,
                                        shuffle=True, num_workers=4)

testset = torchvision.datasets.CIFAR10(root="/home/htut/Desktop/Knowledge_Distillation_Pytorch/datasets", train=False,
                                        download=True, transform=transform_test)
testloader = torch.utils.data.DataLoader(testset, batch_size=100,
                                        shuffle=False, num_workers=4)

classes = ('plane', 'car', 'bird', 'cat', 'deeer',
            'dog', 'frog', 'horse', 'ship', 'truck')

# setup device for training
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# setup Tensorboard file path
writer = SummaryWriter('experiments/teachers/resnet/resnet50')

# Setup best accuracy for comparing and model checkpoints
best_accuracy = 0.0

# Configure the Network

# You can swap out any kind of architectire from /models in here
model_fn = ResNet50()
model_fn = model_fn.to(device)

# print summary of model
summary(model_fn, (3, 32, 32))
# Setup the loss function
criterion = nn.CrossEntropyLoss()

# Setup the optimizer method for all the parameters
optimizer_fn = optim.SGD(model_fn.parameters(), lr=0.1, momentum=0.9, weight_decay=5e-4)

# setup learning rate scheduler 
scheduler = MultiStepLR(optimizer_fn, milestones=[150, 225, 270], gamma=0.1)

train_and_evaluate(model=model_fn, train_dataloader=trainloader, test_dataloader=testloader,
                    optimizer=optimizer_fn, scheduler=scheduler, loss_fn=criterion, total_epochs=300)

writer.close()`

Have you used the scheduler.step()? Your lr will not be updated if the scheduler is not moving forward w.r.t. your epoch.

HtutLynn commented 4 years ago

Have you used the scheduler.step()? Your lr will not be updated if the scheduler is not moving forward w.r.t. your epoch.

Actually, I just found the bug. I was using resnet architecture, that I modified by myself instead of using the resnet architecture from this repo. My bad. After correcting this issue, the results are pretty great as it is mentioned in this thread. I was able to achieve to 95.46% in accuracy with resnet18 by using your learning rate strategy with MultiStepLR. MultiStepLR is pretty convenient if you don't want to use manual tuning of learning rates. Thanks.

vobecant commented 4 years ago

Actually, I just found the bug. I was using resnet architecture, that I modified by myself instead of using the resnet architecture from this repo. My bad. After correcting this issue, the results are pretty great as it is mentioned in this thread. I was able to achieve to 95.46% in accuracy with resnet18 by using your learning rate strategy with MultiStepLR. MultiStepLR is pretty convenient if you don't want to use manual tuning of learning rates. Thanks.

Hi @HtutLynn , would you please mind sharing the code to train the network? I am not able to reproduce the results.

Thanks!

qysnn commented 3 years ago

So does anyone figure out what's the exact reason this implementation gets much better results than the reported results from the paper?

Kaffaljidhmah2 commented 3 years ago

So does anyone figure out what's the exact reason this implementation gets much better results than the reported results from the paper?

According to the previous discussion it seems that this repo uses non-standard network structures that have more channels...

curemio721 commented 3 years ago

So does anyone figure out what's the exact reason this implementation gets much better results than the reported results from the paper?

According to the previous discussion it seems that this repo uses non-standard network structures that have more channels...

I also tried to use the training and testing block with my own model of only 5-cnn layers. What is strange enough is that after changing lr with 0.1, 0.01, 0.001, etc, this 5-cnn model can also reach 92%+ acc ! Therefore, I think the result is more likely from something wrong happened in this code, instead of the merit of non-standard structures.

Syzygianinfern0 commented 3 years ago

Therefore, I think the result is more likely from something wrong happened in this code, instead of the merit of non-standard structures.

@curemio721 I evaluated a ResNet18 model trained using code from this repo (which gave me 95.5% acc on CIFAR10 test) using the testing code from https://github.com/akamaster/pytorch_resnet_cifar10. This repo by akamaster reports itself to be accurate to the paper's implementation and I was also able to reproduce the results from it's README.

To my surprise, the results on the model trained from this repo were still 95.5% when testing with code from akamaster's repo.

ktn222 commented 7 months ago

So does anyone figure out what's the exact reason this implementation gets much better results than the reported results from the paper?

The training results depend also on the random initialized parameters at the beginning. If the initialization started with the "right" parameters, the gradient descent would converge to a better minima.