Open contryboy opened 6 months ago
I further did following testing with the cifar10 notebook for comparison:
@contryboy Tested the ipex sample train [1] with intel-extension-for-pytorch 2.1.20+xpu
and found the loss decreasing to ~1.4 over 5 epochs. Will share more updates if I get to run cj-mills notebook [2]
System: oneapi basetoolkit - 2024.1.0 intel-extension-for-pytorch - 2.1.20+xpu Python - 3.10 GPU Driver - https://dgpu-docs.intel.com/releases/LTS_803.29_20240131.html
import torch
import torchvision
import time
############# code changes ###############
import intel_extension_for_pytorch as ipex
############# code changes ###############
LR = 0.001
DOWNLOAD = True
DATA = "datasets/cifar10/"
device = torch.device('xpu')
transform = torchvision.transforms.Compose(
[
torchvision.transforms.Resize((224, 224)),
torchvision.transforms.ToTensor(),
torchvision.transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
]
)
train_dataset = torchvision.datasets.CIFAR10(
root=DATA,
train=True,
transform=transform,
download=DOWNLOAD,
)
train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=128)
model = torchvision.models.resnet50()
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=LR, momentum=0.9)
model.train()
######################## code changes #######################
model = model.to(device)
criterion = criterion.to(device)
model, optimizer = ipex.optimize(model, optimizer=optimizer)
######################## code changes #######################
num_epoch = 5
running_loss = 0.0
loss_print_batch = 100
start_time = time.time()
for epoch in range(num_epoch):
for batch_idx, (data, target) in enumerate(train_loader):
########## code changes ##########
data = data.to(device)
target = target.to(device)
########## code changes ##########
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
#print(batch_idx)
# print statistics
running_loss += loss.item()
if batch_idx % loss_print_batch == 0:
print(f'[{epoch + 1}, {batch_idx + 1:5d}] loss: {running_loss / loss_print_batch:.3f}')
running_loss = 0.0
print(f"Time to train : {round(time.time()-start_time,2)} seconds")
torch.save(
{
"model_state_dict": model.state_dict(),
"optimizer_state_dict": optimizer.state_dict(),
},
"checkpoint.pth",
)
print("Execution finished")
Hi @vishnumadhu365 , Thanks for your effort. Unfortunately I am not able to try your code to reproduce it again. I have changed to use another graphic card...
@contryboy no worries, feel free to reach out if you still face issues.
Hi there, ran in an weird but simular issue using an Arv A770 and ipex version 2.1.30+xpu The two runs with the highest accuracy were done on my CPU (Intel i5-13500T), the middle ones (darkblue and green) same Setup but on the Arc GPU and the lowest also on Arc but with Eval step.
Don't know why but on of the functions torch.no_grad()
or model.eval()
are stomping my performance. (Probably model.eval()
as stated in Issue#40)
Currently working on an minimal repoducable code example.
Hi @TheMrCodes , could you try ipex 2.1.40+xpu?
Describe the issue
I installed the latest version of oneAPI Base Toolkit and python packages and tried following:
Could you help to take a look the issue, see if you can reproduce at least the first case?
Hardware: Intel Arc A770 16G, Intel i5, 16GRAM.
Software: Ubuntu2204, intel_extension_for_pytorch-2.1.10+xpu, torch-2.1.0a0, torchvision-0.16.0a0
Thanks in advance!
[1] https://intel.github.io/intel-extension-for-pytorch/xpu/latest/tutorials/examples.html#float32 [2] https://christianjmills.com/posts/intel-pytorch-extension-tutorial/native-ubuntu/