Training loss does not improve when running the cifar10 sample

contryboy commented 6 months ago

Describe the issue

I installed the latest version of oneAPI Base Toolkit and python packages and tried following:

I ran the example code [1] runs without an error. How ever, the loss does not improve after several iterations, and stopped around 2.3.
I also tried the example described in [2], observed the similar issues, that the code runs fast, but the accuracy stopped around 0.18. As I commented on that post, the author replied that he also observed similar issues in the newer versions.
I tried another CNN based model which runs fine on NVIDIA P100, but observed similar issue (runs fast, but train loss does not improve) on Intel Arc.

Could you help to take a look the issue, see if you can reproduce at least the first case?

Hardware: Intel Arc A770 16G, Intel i5, 16GRAM.

Software: Ubuntu2204, intel_extension_for_pytorch-2.1.10+xpu, torch-2.1.0a0, torchvision-0.16.0a0

Thanks in advance!

[1] https://intel.github.io/intel-extension-for-pytorch/xpu/latest/tutorials/examples.html#float32 [2] https://christianjmills.com/posts/intel-pytorch-extension-tutorial/native-ubuntu/

contryboy commented 6 months ago

I further did following testing with the cifar10 notebook for comparison:

I modified the note book to run on cpu on the same machine: train loss resulted around 1.8
I modified the note book to run on xpu on the same machine without optimization step (ipex.optimize), train loss resulted around 2.3 (same problem as optimized version)
I ran the notebook on xpu on the same machine with 2 epochs, the train loss keeps stay around 2.3 and does not improve furhter.
I modified the note book to run on p100 gpu on kaggle: train loss resulted around 1.8 (consistent as run on cpu)

vishnumadhu365 commented 5 months ago

@contryboy Tested the ipex sample train [1] with intel-extension-for-pytorch 2.1.20+xpu and found the loss decreasing to ~1.4 over 5 epochs. Will share more updates if I get to run cj-mills notebook [2]

System: oneapi basetoolkit - 2024.1.0 intel-extension-for-pytorch - 2.1.20+xpu Python - 3.10 GPU Driver - https://dgpu-docs.intel.com/releases/LTS_803.29_20240131.html

import torch
import torchvision
import time

############# code changes ###############
import intel_extension_for_pytorch as ipex

############# code changes ###############

LR = 0.001
DOWNLOAD = True
DATA = "datasets/cifar10/"
device = torch.device('xpu')

transform = torchvision.transforms.Compose(
    [
        torchvision.transforms.Resize((224, 224)),
        torchvision.transforms.ToTensor(),
        torchvision.transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
    ]
)
train_dataset = torchvision.datasets.CIFAR10(
    root=DATA,
    train=True,
    transform=transform,
    download=DOWNLOAD,
)
train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=128)

model = torchvision.models.resnet50()
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=LR, momentum=0.9)
model.train()
######################## code changes #######################
model = model.to(device)
criterion = criterion.to(device)
model, optimizer = ipex.optimize(model, optimizer=optimizer)
######################## code changes #######################

num_epoch = 5
running_loss = 0.0
loss_print_batch = 100

start_time = time.time()
for epoch in range(num_epoch):
    for batch_idx, (data, target) in enumerate(train_loader):
        ########## code changes ##########
        data = data.to(device)
        target = target.to(device)
        ########## code changes ##########
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()
        #print(batch_idx)
        # print statistics
        running_loss += loss.item()
        if batch_idx % loss_print_batch == 0:    
            print(f'[{epoch + 1}, {batch_idx + 1:5d}] loss: {running_loss / loss_print_batch:.3f}')
            running_loss = 0.0

print(f"Time to train : {round(time.time()-start_time,2)} seconds")

torch.save(
    {
        "model_state_dict": model.state_dict(),
        "optimizer_state_dict": optimizer.state_dict(),
    },
    "checkpoint.pth",
)

print("Execution finished")

contryboy commented 5 months ago

Hi @vishnumadhu365 , Thanks for your effort. Unfortunately I am not able to try your code to reproduce it again. I have changed to use another graphic card...

vishnumadhu365 commented 4 months ago

@contryboy no worries, feel free to reach out if you still face issues.

TheMrCodes commented 4 months ago

Hi there, ran in an weird but simular issue using an Arv A770 and ipex version 2.1.30+xpu The two runs with the highest accuracy were done on my CPU (Intel i5-13500T), the middle ones (darkblue and green) same Setup but on the Arc GPU and the lowest also on Arc but with Eval step.

Don't know why but on of the functions torch.no_grad() or model.eval() are stomping my performance. (Probably model.eval() as stated in Issue#40) Currently working on an minimal repoducable code example.

huiyan2021 commented 3 days ago

Hi @TheMrCodes , could you try ipex 2.1.40+xpu?

intel / intel-extension-for-pytorch

Training loss does not improve when running the cifar10 sample #537

Describe the issue