Quickstart notebook fails to train properly with ROCm

icefairy64 commented 1 year ago

🐛 Describe the bug

When running a notebook from Quickstart using ROCm with Radeon RX 6900 XT on Ubuntu Server 22.04 I get 0% accuracy, while switching to CPU I get proper ~45%.

Here is a non-notebook reproducer I used:

from typing import Any
import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor
from torch.nn import CrossEntropyLoss
from torch.optim import SGD

if (not torch.cuda.is_available()):
    print("No CUDA device available")
    exit(-1)

device = "cuda"
print(torch.cuda.get_device_properties(device))

# Download training data from open datasets.
training_data = datasets.FashionMNIST(
    root="data",
    train=True,
    download=True,
    transform=ToTensor(),
)

# Download test data from open datasets.
test_data = datasets.FashionMNIST(
    root="data",
    train=False,
    download=True,
    transform=ToTensor(),
)

batch_size = 64

# Create data loaders.
train_dataloader = DataLoader(training_data, batch_size=batch_size)
test_dataloader = DataLoader(test_data, batch_size=batch_size)

for X, y in test_dataloader:
    print(f"Shape of X [N, C, H, W]: {X.shape}")
    print(f"Shape of y: {y.shape} {y.dtype}")
    break

# Define model
class NeuralNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 512),
            nn.ReLU(),
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Linear(512, 10)
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

model = NeuralNetwork().to(device)
print(model)

loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)

def train(dataloader: DataLoader[Any], model: NeuralNetwork, loss_fn: CrossEntropyLoss, optimizer: SGD):
    size = len(dataloader.dataset)
    model.train()
    for batch, (X, y) in enumerate(dataloader):
        X, y = X.to(device), y.to(device)

        # Compute prediction error
        pred = model(X)
        loss = loss_fn(pred, y)

        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if batch % 100 == 0:
            loss, current = loss.item(), (batch + 1) * len(X)
            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")

def test(dataloader, model, loss_fn):
    size = len(dataloader.dataset)
    num_batches = len(dataloader)
    model.eval()
    test_loss, correct = 0, 0
    with torch.no_grad():
        for X, y in dataloader:
            X, y = X.to(device), y.to(device)
            pred = model(X)
            test_loss += loss_fn(pred, y).item()
            correct += (pred.argmax(1) == y).type(torch.float).sum().item()
    test_loss /= num_batches
    correct /= size
    print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")

epochs = 1
for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    train(train_dataloader, model, loss_fn, optimizer)
    test(test_dataloader, model, loss_fn)

print("Done!")

Here is the output for CUDA run:

_CudaDeviceProperties(name='AMD Radeon RX 6900 XT', major=10, minor=3, total_memory=16368MB, multi_processor_count=40)
Shape of X [N, C, H, W]: torch.Size([64, 1, 28, 28])
Shape of y: torch.Size([64]) torch.int64
NeuralNetwork(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=784, out_features=512, bias=True)
    (1): ReLU()
    (2): Linear(in_features=512, out_features=512, bias=True)
    (3): ReLU()
    (4): Linear(in_features=512, out_features=10, bias=True)
  )
)
Epoch 1
-------------------------------
loss: 0.022123  [   64/60000]
loss: 0.000000  [ 6464/60000]
loss: 1.000000  [12864/60000]
loss: 1.000000  [19264/60000]
loss: 0.000000  [25664/60000]
loss: 0.000000  [32064/60000]
loss: 0.000000  [38464/60000]
loss: 0.000000  [44864/60000]
loss: 0.000000  [51264/60000]
loss: 0.249406  [57664/60000]
Test Error: 
 Accuracy: 0.0%, Avg loss: -0.438122 

Done!

And here is the output when I switch to CPU (excluding CUDA device logging):

Shape of X [N, C, H, W]: torch.Size([64, 1, 28, 28])
Shape of y: torch.Size([64]) torch.int64
NeuralNetwork(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=784, out_features=512, bias=True)
    (1): ReLU()
    (2): Linear(in_features=512, out_features=512, bias=True)
    (3): ReLU()
    (4): Linear(in_features=512, out_features=10, bias=True)
  )
)
Epoch 1
-------------------------------
loss: 2.296403  [   64/60000]
loss: 2.294028  [ 6464/60000]
loss: 2.271100  [12864/60000]
loss: 2.277084  [19264/60000]
loss: 2.251947  [25664/60000]
loss: 2.234447  [32064/60000]
loss: 2.227726  [38464/60000]
loss: 2.205537  [44864/60000]
loss: 2.204656  [51264/60000]
loss: 2.163422  [57664/60000]
Test Error: 
 Accuracy: 45.8%, Avg loss: 2.168922 

Done!

Versions

Collecting environment information... PyTorch version: 2.1.0.dev20230502+rocm5.4.2 Is debug build: False CUDA used to build PyTorch: N/A ROCM used to build PyTorch: 5.4.22803-474e8620

OS: Ubuntu 22.04.2 LTS (x86_64) GCC version: (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0 Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.35

Python version: 3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0] (64-bit runtime) Python platform: Linux-5.15.0-71-generic-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: AMD Radeon RX 6900 XT Nvidia driver version: Could not collect cuDNN version: Could not collect HIP runtime version: 5.4.22803 MIOpen runtime version: 2.19.0 Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 2 On-line CPU(s) list: 0,1 Vendor ID: AuthenticAMD Model name: AMD Ryzen 9 5900X 12-Core Processor CPU family: 25 Model: 33 Thread(s) per core: 1 Core(s) per socket: 2 Socket(s): 1 Stepping: 0 BogoMIPS: 7399.99 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr wbnoinvd arat npt lbrv nrip_save tsc_scale vmcb_clean pausefilter pfthreshold v_vmsave_vmload vgif umip pku ospke vaes vpclmulqdq rdpid fsrm arch_capabilities Virtualization: AMD-V Hypervisor vendor: KVM Virtualization type: full L1d cache: 128 KiB (2 instances) L1i cache: 128 KiB (2 instances) L2 cache: 1 MiB (2 instances) L3 cache: 32 MiB (2 instances) NUMA node(s): 1 NUMA node0 CPU(s): 0,1 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

Versions of relevant libraries: [pip3] numpy==1.24.1 [pip3] open-clip-torch==2.19.0 [pip3] pytorch-lightning==2.0.2 [pip3] torch==2.1.0.dev20230502+rocm5.4.2 [pip3] torchaudio==2.1.0.dev20230504+rocm5.4.2 [pip3] torchdiffeq==0.2.3 [pip3] torchmetrics==1.0.0rc0 [pip3] torchsde==0.2.5 [pip3] torchvision==0.16.0.dev20230504+rocm5.4.2 [conda] Could not collect

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport

YellowRoseCx commented 1 year ago

I've been having issues with pytorch 2 rocm also, yet I have no issues with pytorch 1.13.1+rocm

Just out of curiosity, does using pytorch 1.13.1+rocm have the same issue or does it work as intended? pip install torch==1.13.1 --index-url https://download.pytorch.org/whl/rocm5.2 --upgrade

icefairy64 commented 1 year ago

@YellowRoseCx - seems to work as expected:

_CudaDeviceProperties(name='AMD Radeon RX 6900 XT', major=10, minor=3, total_memory=16368MB, multi_processor_count=40)
Shape of X [N, C, H, W]: torch.Size([64, 1, 28, 28])
Shape of y: torch.Size([64]) torch.int64
NeuralNetwork(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=784, out_features=512, bias=True)
    (1): ReLU()
    (2): Linear(in_features=512, out_features=512, bias=True)
    (3): ReLU()
    (4): Linear(in_features=512, out_features=10, bias=True)
  )
)
Epoch 1
-------------------------------
loss: 2.309916  [   64/60000]
loss: 2.295138  [ 6464/60000]
loss: 2.269454  [12864/60000]
loss: 2.264211  [19264/60000]
loss: 2.260758  [25664/60000]
loss: 2.231906  [32064/60000]
loss: 2.242283  [38464/60000]
loss: 2.212953  [44864/60000]
loss: 2.200381  [51264/60000]
loss: 2.183855  [57664/60000]
Test Error: 
 Accuracy: 37.4%, Avg loss: 2.171335 

Done!

Collecting environment information...
PyTorch version: 1.13.1+rocm5.2
Is debug build: False
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: 5.2.21151-afdc89f8

OS: Ubuntu 22.04.2 LTS (x86_64)
GCC version: (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35

Python version: 3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0] (64-bit runtime)
Python platform: Linux-5.15.0-71-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: AMD Radeon RX 6900 XT
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: 5.2.21151
MIOpen runtime version: 2.17.0
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   48 bits physical, 48 bits virtual
Byte Order:                      Little Endian
CPU(s):                          2
On-line CPU(s) list:             0,1
Vendor ID:                       AuthenticAMD
Model name:                      AMD Ryzen 9 5900X 12-Core Processor
CPU family:                      25
Model:                           33
Thread(s) per core:              1
Core(s) per socket:              2
Socket(s):                       1
Stepping:                        0
BogoMIPS:                        7399.99
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr wbnoinvd arat npt lbrv nrip_save tsc_scale vmcb_clean pausefilter pfthreshold v_vmsave_vmload vgif umip pku ospke vaes vpclmulqdq rdpid fsrm arch_capabilities
Virtualization:                  AMD-V
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       128 KiB (2 instances)
L1i cache:                       128 KiB (2 instances)
L2 cache:                        1 MiB (2 instances)
L3 cache:                        32 MiB (2 instances)
NUMA node(s):                    1
NUMA node0 CPU(s):               0,1
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] numpy==1.24.3
[pip3] torch==1.13.1+rocm5.2
[pip3] torchaudio==0.13.1+rocm5.2
[pip3] torchvision==0.14.1+rocm5.2
[conda] Could not collect

pytorch / pytorch

Quickstart notebook fails to train properly with ROCm #100795

🐛 Describe the bug

Versions