A FATAL BUG IN torchvision.transforms WHEN USING ADVERSARIAL TRAINING. Transformations applied to tensors will decrease the performance compared with Transformations applied to PIL images then ToTensor.

GuanlinLee commented 2 years ago

🐛 Describe the bug

For Data Augmentation applied on PIL Images and then using ToTensor, the training code is like this:

import torchvision
import torchvision.transforms as transforms
import torch
import torch.utils.data
from torch import nn
import datasets

transform=transforms.Compose([transforms.RandomCrop(32, padding=4),
                            transforms.RandomHorizontalFlip(),
                              transforms.ToTensor(),
                                   ])
transform_test=transforms.Compose([torchvision.transforms.Resize((32,32)),
                                   transforms.ToTensor(),
                                   ])

trainset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
train_loader = torch.utils.data.DataLoader(trainset, batch_size=batch_size, shuffle=True,
                                               num_workers=0, pin_memory=True)

testset = datasets.CIFAR10(root='./data', train=False, download=True, transform=transform_test)
test_loader = torch.utils.data.DataLoader(testset, batch_size=batch_size, shuffle=False, num_workers=0,
                             pin_memory=True)

n = resnet.resnet18(args.dataset).cuda()
optimizer = torch.optim.SGD(n.parameters(),momentum=args.momentum,
                            lr=learning_rate,weight_decay=wd)

def PGD(model, x, y, optimizer, args):
    model.eval()
    epsilon = args.eps
    num_steps = args.ns
    step_size = args.ss
    x_adv = x.detach() + torch.FloatTensor(*x.shape).uniform_(-epsilon, epsilon).cuda()
    for _ in range(num_steps):
        x_adv.requires_grad_()
        with torch.enable_grad():
            logits_adv = model(x_adv)
            loss = F.cross_entropy(logits_adv, y)
        grad = torch.autograd.grad(loss, [x_adv])[0]
        x_adv = x_adv.detach() + step_size * torch.sign(grad.detach())
        x_adv = torch.min(torch.max(x_adv, x - epsilon), x + epsilon)
        x_adv = torch.clamp(x_adv, 0.0, 1.0)
    model.train()
    x_adv = Variable(torch.clamp(x_adv, 0.0, 1.0), requires_grad=False)
    # zero gradient
    optimizer.zero_grad()
    # calculate robust loss
    logits = model(x_adv)
    loss = F.cross_entropy(logits, y)
    return logits, loss

for epoch in range(epochs):
    loadertrain = tqdm(train_loader, desc='{} E{:03d}'.format('train', epoch), ncols=0)
    epoch_loss = 0.0
    epoch_loss_clean = 0.0
    total=0.0
    clean_acc = 0.0
    adv_acc = 0.0
    for (input, target, index) in loadertrain:
        n.eval()
        x_train, y_train = input.cuda(), target.cuda()
        #print(x_train.size())
        #print(x_train.min(), x_train.max())
        #torchvision.utils.save_image(input, 'images_dir.png', nrow=64)
        #exit(0)
        y_pre = n(x_train)
        loss_clean = F.cross_entropy(y_pre, y_train)
        epoch_loss_clean += loss_clean.data.item()
        logits_adv, loss = PGD(n, x_train, y_train, optimizer, args)
        loss.backward()
        optimizer.step()
        epoch_loss += loss.data.item()
        _, predicted = torch.max(y_pre.data, 1)
        _, predictedadv = torch.max(logits_adv.data, 1)
        total += y_train.size(0)
        clean_acc += predicted.eq(y_train.data).cuda().sum()
        adv_acc += predictedadv.eq(y_train.data).cuda().sum()
        fmt = '{:.4f}'.format
        loadertrain.set_postfix(loss=fmt(loss.data.item()),
                                acc_cl=fmt(clean_acc.item() / total * 100),
                                acc_adv=fmt(adv_acc.item() / total * 100))

For First using ToTensor to covert PIL Images to tensors firstly and the applying Data Augmentation on tensors, the training code is like this:

import torchvision
import torchvision.transforms as transforms
import torch
import torch.utils.data
from torch import nn
import datasets

transform_test=transforms.Compose([torchvision.transforms.Resize((32,32)),
                                   transforms.ToTensor(),
                                   ])

trainset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform_test)
train_loader = torch.utils.data.DataLoader(trainset, batch_size=batch_size, shuffle=True,
                                               num_workers=0, pin_memory=True)

testset = datasets.CIFAR10(root='./data', train=False, download=True, transform=transform_test)
test_loader = torch.utils.data.DataLoader(testset, batch_size=batch_size, shuffle=False, num_workers=0,
                             pin_memory=True)
def data_aug(image):
    image = transforms.RandomCrop(32, padding=4).forward(image)
    image = transforms.RandomHorizontalFlip().forward(image)
    return image

n = resnet.resnet18(args.dataset).cuda()
optimizer = torch.optim.SGD(n.parameters(),momentum=args.momentum,
                            lr=learning_rate,weight_decay=wd)

def PGD(model, x, y, optimizer, args):
    model.eval()
    epsilon = args.eps
    num_steps = args.ns
    step_size = args.ss
    x_adv = x.detach() + torch.FloatTensor(*x.shape).uniform_(-epsilon, epsilon).cuda()
    for _ in range(num_steps):
        x_adv.requires_grad_()
        with torch.enable_grad():
            logits_adv = model(x_adv)
            loss = F.cross_entropy(logits_adv, y)
        grad = torch.autograd.grad(loss, [x_adv])[0]
        x_adv = x_adv.detach() + step_size * torch.sign(grad.detach())
        x_adv = torch.min(torch.max(x_adv, x - epsilon), x + epsilon)
        x_adv = torch.clamp(x_adv, 0.0, 1.0)
    model.train()
    x_adv = Variable(torch.clamp(x_adv, 0.0, 1.0), requires_grad=False)
    # zero gradient
    optimizer.zero_grad()
    # calculate robust loss
    logits = model(x_adv)
    loss = F.cross_entropy(logits, y)
    return logits, loss

for epoch in range(epochs):
    loadertrain = tqdm(train_loader, desc='{} E{:03d}'.format('train', epoch), ncols=0)
    epoch_loss = 0.0
    epoch_loss_clean = 0.0
    total=0.0
    clean_acc = 0.0
    adv_acc = 0.0
    for (input, target, index) in loadertrain:
        n.eval()
        x_train, y_train = data_aug(input).cuda(), target.cuda()
        #print(x_train.size())
        #print(x_train.min(), x_train.max())
        #torchvision.utils.save_image(input, 'images_dir.png', nrow=64)
        #exit(0)
        y_pre = n(x_train)
        loss_clean = F.cross_entropy(y_pre, y_train)
        epoch_loss_clean += loss_clean.data.item()
        logits_adv, loss = PGD(n, x_train, y_train, optimizer, args)
        loss.backward()
        optimizer.step()
        epoch_loss += loss.data.item()
        _, predicted = torch.max(y_pre.data, 1)
        _, predictedadv = torch.max(logits_adv.data, 1)
        total += y_train.size(0)
        clean_acc += predicted.eq(y_train.data).cuda().sum()
        adv_acc += predictedadv.eq(y_train.data).cuda().sum()
        fmt = '{:.4f}'.format
        loadertrain.set_postfix(loss=fmt(loss.data.item()),
                                acc_cl=fmt(clean_acc.item() / total * 100),
                                acc_adv=fmt(adv_acc.item() / total * 100))

Clearly, these two versions of training codes are expected to have similar results on the train set. However, I find the second one can cause a significant decrease. Here are training results:

For First Version:

train E000: 100% 391/391 [03:33<00:00,  1.83it/s, acc_adv=18.3220, acc_cl=23.9080, loss=2.0751]
train E001: 100% 391/391 [03:25<00:00,  1.90it/s, acc_adv=24.2700, acc_cl=33.4160, loss=2.0871]
train E002: 100% 391/391 [03:25<00:00,  1.90it/s, acc_adv=27.2680, acc_cl=38.3020, loss=1.9479]
train E003: 100% 391/391 [03:25<00:00,  1.90it/s, acc_adv=29.6860, acc_cl=41.8460, loss=1.7627]
train E004: 100% 391/391 [03:25<00:00,  1.90it/s, acc_adv=31.9920, acc_cl=45.5160, loss=1.8148]
train E005: 100% 391/391 [03:25<00:00,  1.91it/s, acc_adv=33.4660, acc_cl=48.8660, loss=1.7631]
train E006: 100% 391/391 [03:24<00:00,  1.91it/s, acc_adv=35.3880, acc_cl=52.4200, loss=1.9450]
train E007: 100% 391/391 [03:24<00:00,  1.91it/s, acc_adv=37.2760, acc_cl=55.7980, loss=1.7123]
train E008: 100% 391/391 [03:29<00:00,  1.86it/s, acc_adv=38.4040, acc_cl=58.1720, loss=1.6121]
train E009: 100% 391/391 [03:33<00:00,  1.83it/s, acc_adv=39.8820, acc_cl=60.1360, loss=1.5887]
train E010: 100% 391/391 [03:33<00:00,  1.83it/s, acc_adv=40.4740, acc_cl=61.7580, loss=1.5841]
train E011: 100% 391/391 [03:25<00:00,  1.90it/s, acc_adv=41.1380, acc_cl=62.9580, loss=1.6211]
train E012: 100% 391/391 [03:26<00:00,  1.89it/s, acc_adv=41.7220, acc_cl=64.0100, loss=1.7085]
train E013: 100% 391/391 [03:25<00:00,  1.90it/s, acc_adv=42.4960, acc_cl=64.8200, loss=1.4290]
train E014: 100% 391/391 [03:28<00:00,  1.87it/s, acc_adv=42.9800, acc_cl=66.0760, loss=1.6474]
train E015: 100% 391/391 [03:27<00:00,  1.88it/s, acc_adv=43.4940, acc_cl=66.2420, loss=1.4414]
train E016: 100% 391/391 [03:24<00:00,  1.91it/s, acc_adv=44.1820, acc_cl=67.0800, loss=1.4017]
train E017: 100% 391/391 [03:24<00:00,  1.91it/s, acc_adv=44.5940, acc_cl=67.4260, loss=1.3250]
train E018: 100% 391/391 [03:24<00:00,  1.91it/s, acc_adv=44.7940, acc_cl=68.2820, loss=1.3789]
train E019: 100% 391/391 [03:24<00:00,  1.91it/s, acc_adv=44.9340, acc_cl=69.2480, loss=1.7059]
train E020: 100% 391/391 [03:31<00:00,  1.85it/s, acc_adv=45.3140, acc_cl=68.8420, loss=1.5010]
train E021: 100% 391/391 [03:27<00:00,  1.89it/s, acc_adv=45.4080, acc_cl=69.4040, loss=1.5041]
train E022: 100% 391/391 [03:32<00:00,  1.84it/s, acc_adv=45.9200, acc_cl=69.9480, loss=1.6118]
train E023: 100% 391/391 [03:34<00:00,  1.83it/s, acc_adv=46.2520, acc_cl=70.2560, loss=1.3809]
train E024: 100% 391/391 [03:33<00:00,  1.84it/s, acc_adv=46.3560, acc_cl=70.7100, loss=1.5231]
train E025: 100% 391/391 [03:32<00:00,  1.84it/s, acc_adv=46.7160, acc_cl=70.7060, loss=1.3514]

For Second Version:

train E000: 100% 391/391 [03:39<00:00,  1.78it/s, acc_adv=19.2060, acc_cl=24.8320, loss=2.0385] 
train E001: 100% 391/391 [03:22<00:00,  1.93it/s, acc_adv=25.3200, acc_cl=33.6160, loss=1.9477] 
train E002: 100% 391/391 [03:22<00:00,  1.93it/s, acc_adv=28.0940, acc_cl=37.8420, loss=1.8984] 
train E003: 100% 391/391 [03:24<00:00,  1.91it/s, acc_adv=31.1120, acc_cl=42.1860, loss=1.8444] 
train E004: 100% 391/391 [03:25<00:00,  1.91it/s, acc_adv=33.0940, acc_cl=45.5480, loss=1.7989] 
train E005: 100% 391/391 [03:31<00:00,  1.85it/s, acc_adv=35.7460, acc_cl=47.8940, loss=1.6479] 
train E006: 100% 391/391 [03:22<00:00,  1.93it/s, acc_adv=37.1300, acc_cl=51.5220, loss=1.7013] 
train E007: 100% 391/391 [03:23<00:00,  1.92it/s, acc_adv=38.9160, acc_cl=54.5040, loss=1.6578] 
train E008: 100% 391/391 [03:23<00:00,  1.92it/s, acc_adv=40.3340, acc_cl=57.2940, loss=1.5783] 
train E009: 100% 391/391 [03:22<00:00,  1.93it/s, acc_adv=41.4840, acc_cl=58.7920, loss=1.5779] 
train E010: 100% 391/391 [03:22<00:00,  1.93it/s, acc_adv=42.2060, acc_cl=60.8740, loss=1.5486] 
train E011: 100% 391/391 [03:22<00:00,  1.93it/s, acc_adv=43.4280, acc_cl=61.2060, loss=1.5321] 
train E012: 100% 391/391 [03:22<00:00,  1.93it/s, acc_adv=43.9800, acc_cl=62.5380, loss=1.4580] 
train E013: 100% 391/391 [03:22<00:00,  1.93it/s, acc_adv=45.0140, acc_cl=62.8660, loss=1.4623] 
train E014: 100% 391/391 [03:22<00:00,  1.93it/s, acc_adv=45.5440, acc_cl=63.6520, loss=1.2016] 
train E015: 100% 391/391 [03:22<00:00,  1.93it/s, acc_adv=45.8320, acc_cl=64.0760, loss=1.4724] 
train E016: 100% 391/391 [03:22<00:00,  1.93it/s, acc_adv=46.7220, acc_cl=65.0180, loss=1.4473] 
train E017: 100% 391/391 [03:22<00:00,  1.93it/s, acc_adv=47.0440, acc_cl=65.7060, loss=1.3716] 
train E018: 100% 391/391 [03:22<00:00,  1.93it/s, acc_adv=47.5380, acc_cl=65.5000, loss=1.6023] 
train E019: 100% 391/391 [03:22<00:00,  1.94it/s, acc_adv=48.3940, acc_cl=65.3140, loss=1.3627] 
train E020: 100% 391/391 [03:22<00:00,  1.93it/s, acc_adv=49.7800, acc_cl=65.8700, loss=1.3082] 
train E021: 100% 391/391 [03:21<00:00,  1.94it/s, acc_adv=50.1800, acc_cl=65.4220, loss=1.2542] 
train E022: 100% 391/391 [03:37<00:00,  1.80it/s, acc_adv=50.7860, acc_cl=64.0980, loss=1.0322] 
train E023: 100% 391/391 [03:36<00:00,  1.81it/s, acc_adv=51.6840, acc_cl=64.7940, loss=1.0864] 
train E024: 100% 391/391 [03:41<00:00,  1.76it/s, acc_adv=51.9820, acc_cl=65.0020, loss=1.2308] 
train E025: 100% 391/391 [03:31<00:00,  1.84it/s, acc_adv=52.9980, acc_cl=64.7800, loss=1.2453] 
train E026: 100% 391/391 [03:31<00:00,  1.85it/s, acc_adv=53.0580, acc_cl=63.5320, loss=1.3692] 
train E027: 100% 391/391 [03:28<00:00,  1.88it/s, acc_adv=52.2980, acc_cl=65.0500, loss=0.9834] 
train E028:  34% 134/391 [01:09<02:13,  1.93it/s, acc_adv=54.4834, acc_cl=62.6399, loss=1.0394

I save some images before and after data augmentation:

images

The first two rows are images before data augmentations. 3-4 rows are images for Second Version Code.

images_dir

The images are for First Version Code.

So, what is the reason making the two data augmentation ways give different results? There may some bugs. However, I cannot find them from the source codes.

Versions

cudatoolkit 10.1.243 h6bb024c_0
cudnn 7.6.5 cuda10.1_0
numpy 1.19.5 pypi_0 pypi opencv-python 4.5.5.64 pypi_0 pypi pillow 8.4.0 py37h5aabda8_0 python 3.7.1 h0371630_7 torch 1.10.2 pypi_0 pypi torchvision 0.11.3 pypi_0 pypi

YosuaMichael commented 2 years ago

Hi @GuanlinLee from the provided script I notice that the transform operations are not exactly the same. To be specific, I think on the second script you have additional torchvision.transforms.Resize((32,32)) in the beginning for the training compared to the first script. Could you try again with the second script updated to something like:

...
trainset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transforms.ToTensor())
...

GuanlinLee commented 2 years ago

Hi, I think the reason causing this phenomenon is that the data augmentation in data_aug(), all inputs are applied the same transformation inside the batch. When using the dataloader, the transformation is individually applied to each instance. So, I think the transformation for tensors with batch_size should have a constant behavior, i.e., for each instance the transformation should be random.

YosuaMichael commented 2 years ago

Hi @GuanlinLee , I think they are not exactly the same transformation. The first script do the following:

transform=transforms.Compose([transforms.RandomCrop(32, padding=4),
                            transforms.RandomHorizontalFlip(),
                              transforms.ToTensor(),
                                   ])
trainset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform_test)

The transforms applied to train image: RandomCrop -> RandomHorizontalFlip -> ToTensor

While on the second script:

transform_test=transforms.Compose([torchvision.transforms.Resize((32,32)),
                                   transforms.ToTensor(),
                                   ])
trainset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform_test)

...

def data_aug(image):
    image = transforms.RandomCrop(32, padding=4).forward(image)
    image = transforms.RandomHorizontalFlip().forward(image)
    return image

...

x_train, y_train = data_aug(input).cuda(), target.cuda()

The transform applied to train image: Resize -> ToTensor -> RandomCrop -> RandomHorizontalFlip.

To summarize, the transform applied on the train image are:

[1st script] RandomCrop -> RandomHorizontalFlip -> ToTensor
[2nd script] Resize -> ToTensor -> RandomCrop -> RandomHorizontalFlip

We have extra Resize on the second script. And my guess is that this is the reason why they are different.

GuanlinLee commented 2 years ago

Hi @YosuaMichael, I know what you mean. However, I have tested it already. Even if I only use ToTensor and then RandomCrop and RandomHorizontalFlip, the results are the same as the 2nd script.

YosuaMichael commented 2 years ago

Hi @YosuaMichael, I know what you mean. However, I have tested it already. Even if I only use ToTensor and then RandomCrop and RandomHorizontalFlip, the results are the same as the 2nd script.

I see, so even if you remove resize from second script, the resulting accuracy stay the same as before? In that case I am not too sure what is the root problem and may need further investigation. I will try to reproduce your result first and let you know if I got something.

YosuaMichael commented 2 years ago

Hi @GuanlinLee , I tried to reproduce the problem using your code. However seems like the script you provide can't be run.

I encounter several errors like variable not defined, import error, etc.

Could you help me by providing a minimal sample script that can be run without error? For instance I think the PGD function can be removed for the minimal example.

GuanlinLee commented 2 years ago

Hi @YosuaMichael, I have uploaded my code to github. You can find it in https://github.com/GuanlinLee/PGD_Demo. If you meet any problem when running it, please let me know. Thanks!

datumbox commented 2 years ago

@GuanlinLee Your repo is several hundreds lines which is far to long for us to investigate. This is why Yosua asked you to provide a minimal example that reproduces the problem. Ideally this should be only a few lines of code that clearly reproduces the issue. Without this, we won't be able to help I'm afraid.

GuanlinLee commented 2 years ago

@datumbox The bug happened during the training process. So, I need to provide the full training and evaluation code for you to check and repeat. And, I find the bug only happens when using adversarial training, as the issue said.

datumbox commented 2 years ago

@GuanlinLee I understand. The problem is if the code that reproduces the issue is 800 lines long, it's going to be very hard for us to review and debug. I appreciate that the problem is complex because it involves training models but to make your issue more actionable it will help if you further debug further and provide a minimal example.

GuanlinLee commented 2 years ago

Hi @datumbox, I have just modified my repo. Now, the code only has about 100+ lines. The model resnet is the official one but for cifar10, so the kernel size of the first covn layer is 3 instead of 7. Hope the current version can help you debug.

YosuaMichael commented 2 years ago

Thanks for the changes, I will try it out and see if I can reproduce the differences that you specify.

YosuaMichael commented 2 years ago

Hi @GuanlinLee, I think the differences of accuracy in the training is caused by randomness. When I run using --aug=0 and --aug=1 indeed they produce different result, but the differences is not big (roughly similar with running same aug twice).

Then I make sure that both method of transform produce same output by using the following script:

import torchvision
from torchvision import datasets
import torchvision.transforms as transforms
import torch
import random

def set_seed(seed=0):
    torch.manual_seed(seed)
    random.seed(seed)

trainset_1 = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform_1)
trainset_2 = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform_2)

def compare_data(n):
    # Get data using method 1
    set_seed()
    transform_1=transforms.Compose([transforms.RandomCrop(32, padding=4),
                    transforms.RandomHorizontalFlip(),
                    transforms.ToTensor() ])
    x_1, y_1 = trainset_1[n]

    # Get data using method 2
    set_seed()
    transform_2=transforms.Compose([transforms.ToTensor()])
    def data_aug(image):
        image = transforms.RandomCrop(32, padding=4)(image)
        image = transforms.RandomHorizontalFlip()(image)
        return image
    x_2, y_2 = trainset_2[n]
    x_2 = data_aug(x_2)

    return torch.allclose(x_1, x_2)

is_all_true = True
for i in range(len(trainset_1)):
    if compare_data(i) == False:
        print(f"[n={i}] return False!")
        is_all_true = False
        break

if is_all_true:
    print("All data is the same!")

And indeed this script will print All data is the same!, which mean whether using method_1 or method_2, they produce very close result and that is what we expect.

GuanlinLee commented 2 years ago

@YosuaMichael Thanks for your verification. Have you tried to run more number of training epochs? And if possible, please let me see the training accuracy on both adversarial examples and clean data. The differences between aug_1 and aug_2 will be much bigger with the training epochs increasing. However, I do not know the reason. And I run my experiments under the same random seed.

YosuaMichael commented 2 years ago

Hi @GuanlinLee , I have experimented by running multiple times for each variant and try to make it as deterministic as possible (setting seed on multiple place), and indeed the the second method consistently having around 2-4% more accuracy after 100 epoch.

After more investigation and thinking, I think I know why it is different. On the second method you apply the augmentation (RandomCrop + HorizontalFlip) on a batch level, hence all images on the same batch will get the same exact randomness in augmentation. In case of RandomCrop, they are cropped on the same location.

On the other hand the first method apply the augmentation on the image level, hence every image in the batch got different augmentation randomness. In the case of RandomCrop, each image on the batch are cropped on different location.

I am not really sure why the second method yield higher accuracy on the training. My current hypothesis, it maybe because the data is kinda easier (less augmentation randomness) and hence easier to converge as well. However I feel that if you try to measure the test accuracy in the long run, it might not be the case.

Overall I think this is not a bug on torchvision, but more unexpected behaviour on the implementation. I will close the issue now since I think we know the cause of the problem and it is not a bug on torchvision. However feel free to reopen if you think there is different explanation.

pytorch / vision

A FATAL BUG IN torchvision.transforms WHEN USING ADVERSARIAL TRAINING. Transformations applied to tensors will decrease the performance compared with Transformations applied to PIL images then ToTensor. #6190

🐛 Describe the bug

Versions