aayest commented 6 months ago

🐛 Describe the bug

The enclosed python code (at the end) defines and executes a simplified training loop for a Faster R-CNN model with a ResNet-50 FPN backbone, using the PyTorch and torchvision libraries. It's designed for object detection tasks, where the goal is to both classify and determine the bounding box coordinates of objects within images. Here's a detailed breakdown of its components:

Imports: Essential libraries and modules are imported, including PyTorch, torchvision (for model and transforms), pandas (for handling CSV data), PIL (for image processing), numpy, and albumentations (for image augmentation).

SimpleDataset Class: A custom dataset class inheriting from torch.utils.data.Dataset. It's initialized with a dataframe (containing image IDs and bounding box coordinates) and an image directory path.

getitem: Retrieves an image by its index, loads it, applies transformations, and returns the image tensor along with its corresponding target (bounding boxes and labels). len: Returns the total number of images in the dataset. Transformations: The get_transform function returns an albumentations composition of transformations including resizing to 1024x1024 pixels, normalization, and conversion to a PyTorch tensor.

Data Loading: The load_data function reads a CSV file into a pandas DataFrame and initializes the SimpleDataset class with it. This dataset is then wrapped in a DataLoader for batch processing during training.

Model Preparation:

Device selection based on CUDA availability (GPU acceleration) or fallback to MPS (Metal Performance Shaders) for Mac GPUs. Instantiation of a pre-trained Faster R-CNN model with a ResNet-50 FPN backbone, updated to predict a custom number of classes (num_classes includes the background as a class). Modification of the model's classifier to output the correct number of classes. Transfer of the model to the selected device. Optimizer: Setup of the SGD (Stochastic Gradient Descent) optimizer with specified learning rate, momentum, and weight decay, only optimizing parameters that require gradients.

Training Loop:

The model is set to training mode. For each epoch, images and their targets are loaded in batches from the DataLoader. Each batch of images is processed by the model to compute the loss (combining classification and bounding box regression losses). Backpropagation is performed to update the model's weights. After each epoch, the cumulative loss is printed out. Conclusion: After training for the specified number of epochs, a completion message is printed. The code works when the setting is device = torch.device("cuda" if torch.cuda.is_available() else "cpu") and produces the following output: /Users/aa/PycharmProjects/GPUDebug/.venv/bin/python /Users/aa/PycharmProjects/GPUDebug/SmallTest.py Epoch #1 loss: 1.7287453413009644 Epoch #2 loss: 0.9680976271629333 Epoch #3 loss: 1.3803094625473022 Training completed.

Process finished with exit code 0. When I attempt to use: device = torch.device("cuda" if torch.cuda.is_available() else "mps"), I get this error: /Users/aa/PycharmProjects/GPUDebug/.venv/bin/python /Users/aa/PycharmProjects/GPUDebug/SmallTest.py Traceback (most recent call last): File "/Users/aa/PycharmProjects/GPUDebug/SmallTest.py", line 82, in loss_dict = model(images, targets) File "/Users/aa/PycharmProjects/GPUDebug/.venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/Users/aa/PycharmProjects/GPUDebug/.venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, *kwargs) File "/Users/aa/PycharmProjects/GPUDebug/.venv/lib/python3.9/site-packages/torchvision/models/detection/generalized_rcnn.py", line 104, in forward proposals, proposal_losses = self.rpn(images, features, targets) File "/Users/aa/PycharmProjects/GPUDebug/.venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/Users/aa/PycharmProjects/GPUDebug/.venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/Users/aa/PycharmProjects/GPUDebug/.venv/lib/python3.9/site-packages/torchvision/models/detection/rpn.py", line 380, in forward loss_objectness, loss_rpn_box_reg = self.compute_loss( File "/Users/aa/PycharmProjects/GPUDebug/.venv/lib/python3.9/site-packages/torchvision/models/detection/rpn.py", line 324, in compute_loss box_loss = F.smooth_l1_loss( File "/Users/aa/PycharmProjects/GPUDebug/.venv/lib/python3.9/site-packages/torch/nn/functional.py", line 3243, in smooth_l1_loss return torch._C._nn.smooth_l1_loss(expanded_input, expanded_target, _Reduction.get_enum(reduction), beta) RuntimeError: [srcBuf length] > 0 INTERNAL ASSERT FAILED at "/Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/native/mps/OperationUtils.mm":341, please report a bug to PyTorch. Placeholder tensor is empty!

Process finished with exit code 1 Also note that the utility code: import torch import numpy as np import pandas as pd import sklearn import matplotlib.pyplot as plt

print(f"PyTorch version: {torch.version}")

Check PyTorch has access to MPS (Metal Performance Shader, Apple's GPU architecture)

print(f"Is MPS (Metal Performance Shader) built? {torch.backends.mps.is_built()}") print(f"Is MPS available? {torch.backends.mps.is_available()}")

Set the device

device = "mps" if torch.backends.mps.is_available() else "cpu" print(f"Using device: {device}")

import torch

Set the device

device = "mps" if torch.backends.mps.is_available() else "cpu"

Create data and send it to the device

x = torch.rand(size=(3, 4)).to(device) print(x) Validates the environment and the ability to use MPS. See the output: /Users/aa/PycharmProjects/GPUDebug/.venv/bin/python /Users/aa/PycharmProjects/GPUDebug/gpsvalidaton.py PyTorch version: 2.2.2 Is MPS (Metal Performance Shader) built? True Is MPS available? True Using device: mps tensor([[0.6448, 0.0164, 0.6086, 0.1586], [0.2657, 0.9320, 0.5535, 0.8099], [0.6838, 0.2508, 0.8703, 0.8703]], device='mps:0')

Process finished with exit code 0 Actual Code: import torch from torchvision.models.detection import fasterrcnn_resnet50_fpn from torchvision.models.detection.faster_rcnn import FastRCNNPredictor, FasterRCNN_ResNet50_FPN_Weights

from torchvision.transforms import functional as F from torch.utils.data import DataLoader, Dataset import os import pandas as pd from PIL import Image import numpy as np from albumentations import Compose, Normalize, Resize from albumentations.pytorch import ToTensorV2

class SimpleDataset(Dataset): def init(self, dataframe, image_dir): self.image_ids = dataframe['img_id'].unique() self.df = dataframe self.image_dir = image_dir

def __getitem__(self, idx):
    img_id = self.image_ids[idx]
    records = self.df[self.df['img_id'] == img_id]

    # Updated line: Format the image filename with leading zeros
    img_path = os.path.join(self.image_dir, f"{img_id:08d}.jpg")

    image = Image.open(img_path).convert("RGB")
    image = np.array(image, dtype=np.float32)
    image /= 255.0  # Normalize to [0,1]

    boxes = records[['xmin', 'ymin', 'xmax', 'ymax']].values
    labels = torch.ones((records.shape[0],), dtype=torch.int64)  # Assuming all are the same class

    target = {}
    target['boxes'] = torch.as_tensor(boxes, dtype=torch.float32)
    target['labels'] = labels

    image = F.to_tensor(image)
    return image, target

def __len__(self):
    return len(self.image_ids)

def get_transform(): return Compose([ Resize(1024, 1024), Normalize(), ToTensorV2(), ])

def load_data(csv_file, image_dir): df = pd.read_csv(csv_file) return SimpleDataset(df, image_dir)

train_dataset = load_data('data/warmup_train.csv', 'data/warmup_train_images')

train_data_loader = DataLoader( train_dataset, batch_size=4, shuffle=True, collate_fn=lambda x: tuple(zip(*x)) )

device = torch.device("cuda" if torch.cuda.is_available() else "mps")

Using weights parameter instead of pretrained

model = fasterrcnn_resnet50_fpn(weights=FasterRCNN_ResNet50_FPN_Weights.DEFAULT) num_classes = 12 # Including the background in_features = model.roi_heads.box_predictor.cls_score.in_features model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes) model.to(device)

params = [p for p in model.parameters() if p.requires_grad] optimizer = torch.optim.SGD(params, lr=0.005, momentum=0.9, weight_decay=0.0005)

for epoch in range(3): # Training for 3 epochs for simplicity model.train() for images, targets in train_data_loader: images = list(img.to(device) for img in images) targets = [{k: v.to(device) for k, v in t.items()} for t in targets]

    loss_dict = model(images, targets)
    losses = sum(loss for loss in loss_dict.values())

    optimizer.zero_grad()
    losses.backward()
    optimizer.step()

print(f"Epoch #{epoch + 1} loss: {losses.item()}")

print("Training completed.")

Versions

(.venv) (base) aa@Alejandros-MacBook-Pro GPUDebug % curl -O https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py

% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 22068 100 22068 0 0 136k 0 --:--:-- --:--:-- --:--:-- 136k (.venv) (base) aa@Alejandros-MacBook-Pro GPUDebug % python collect_env.py

Collecting environment information... PyTorch version: 2.2.2 Is debug build: False CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A

OS: macOS 14.4.1 (arm64) GCC version: Could not collect Clang version: 15.0.0 (clang-1500.3.9.4) CMake version: Could not collect Libc version: N/A

Python version: 3.9.13 (v3.9.13:6de2ca5339, May 17 2022, 11:37:23) [Clang 13.0.0 (clang-1300.0.29.30)] (64-bit runtime) Python platform: macOS-14.4.1-arm64-arm-64bit Is CUDA available: False CUDA runtime version: No CUDA CUDA_MODULE_LOADING set to: N/A GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Apple M2 Max

Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] torch==2.2.2 [pip3] torchvision==0.17.2 [conda] numpy 1.23.5 pypi_0 pypi [conda] numpydoc 1.5.0 py311hca03da5_0
[conda] torch 2.1.2 pypi_0 pypi [conda] torchaudio 2.1.2 pypi_0 pypi [conda] torchvision 0.16.2 pypi_0 pypi

cc @kulinseth @albanD @malfet @DenisVieriu97 @razarmehr

jbschlosser commented 6 months ago

Hey @aayest, it looks like the error you're seeing only happens on MPS, is that correct?

aayest commented 6 months ago

Howdy Joel,

Yes to your question. AA

__ Alejandro Ayestaran @.***

On Tue, Apr 2, 2024 at 11:57 AM Joel Schlosser @.***> wrote:

Hey @aayest https://github.com/aayest, it looks like the error you're seeing only happens on MPS, is that correct?

— Reply to this email directly, view it on GitHub https://github.com/pytorch/pytorch/issues/123171#issuecomment-2032443708, or unsubscribe https://github.com/notifications/unsubscribe-auth/BBUIC5BSS4UYZ7IMQJP23VTY3LIOVAVCNFSM6AAAAABFTTXGMSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMZSGQ2DGNZQHA . You are receiving this because you were mentioned.Message ID: @.***>

ethrx commented 6 months ago

Hi. Have this issue too...

File "/Users/ethrx/Documents/Red/.env/lib/python3.10/site-packages/torch/nn/functional.py", line 1473, in relu
    result = torch.relu(input)
RuntimeError: [srcBuf length] > 0 INTERNAL ASSERT FAILED at "/Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/native/mps/OperationUtils.mm":341, please report a bug to PyTorch. Placeholder tensor is empty!

/pytorch/pytorch/aten/src/ATen/native/mps/OperationUtils.mm":341, please report a bug to PyTorch. Placeholder tensor is empty! Using PyTorch 2.2.2

yeison commented 6 months ago

I have the same error on "mps". Same pytorch version 2.2.2. I this related to returning None instead of NaN?

RuntimeError: [srcBuf length] > 0 INTERNAL ASSERT FAILED at "/Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/native/mps/OperationUtils.mm":341, please report a bug to PyTorch. Placeholder tensor is empty!

JuanseHevia commented 6 months ago

Just the same error here! Any ideas on how to solve this?

aayest commented 6 months ago

I don’t think there is “fix” yet. I had to use “cpu” instead and wait 3 days instead of a few hours for a model to train …

AA __ Alejandro Ayestaran @.***

On Apr 20, 2024, at 5:57 PM, Juan Segundo Hevia @.***> wrote:

Just the same error here! Any ideas on how to solve this?

— Reply to this email directly, view it on GitHub https://github.com/pytorch/pytorch/issues/123171#issuecomment-2067808179, or unsubscribe https://github.com/notifications/unsubscribe-auth/BBUIC5C4GD6CPFVJHJCIJCTY6LXFNAVCNFSM6AAAAABFTTXGMSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRXHAYDQMJXHE. You are receiving this because you were mentioned.

yeison commented 6 months ago

I compiled the latest code from the master branch (2.4.0) and now the issue is resolved. Not sure if there was a fix.

aayest commented 5 months ago

The last version I can update with pip is 2.2.0 (not even 2.3 which is supposed to be released in April) and the issue still present in that version. I am glad that with 2.4.0 is working, although that version is not official until July 2024 … It would be nice to know: (1) When is the official release date (day) for 2.3 (2) Is this bug/issue fixed in that version.

Regards, AA. __ Alejandro Ayestaran @.***

On Apr 21, 2024, at 9:30 PM, Yeison Rodriguez @.***> wrote:

(2.4.0) and now the issue is resolved. Not sure if there was a fix.

aayest commented 5 months ago

I guess this is better than before: (Using 2.3.0)

/Users/aa/PycharmProjects/GPUDebug/.venv/bin/python /Users/aa/PycharmProjects/GPUDebug/SmallTest.py PyTorch version: 2.3.0 Is MPS (Metal Performance Shader) built? True Is MPS available? True Using device: mps Error: command buffer exited with error status. The Metal Performance Shaders operations encoded on it may not have completed. Error: (null) Internal Error (0000000e:Internal Error) <AGXG14XFamilyCommandBuffer: 0x45fd0ea50> label = device = <AGXG14CDevice: 0x14db6c200> name = Apple M2 Max commandQueue = <AGXG14XFamilyCommandQueue: 0x14db9d800> label = device = <AGXG14CDevice: 0x14db6c200> name = Apple M2 Max retainedReferences = 1 Error: command buffer exited with error status. The Metal Performance Shaders operations encoded on it may not have completed. Error: (null) Internal Error (0000000e:Internal Error) <AGXG14XFamilyCommandBuffer: 0x45fd0ea50> label = device = <AGXG14CDevice: 0x14db6c200> name = Apple M2 Max commandQueue = <AGXG14XFamilyCommandQueue: 0x14db9d800> label = device = <AGXG14CDevice: 0x14db6c200> name = Apple M2 Max retainedReferences = 1 Error: command buffer exited with error status. The Metal Performance Shaders operations encoded on it may not have completed. Error: (null) Internal Error (0000000e:Internal Error) <AGXG14XFamilyCommandBuffer: 0x45fd13030> label = device = <AGXG14CDevice: 0x14db6c200> name = Apple M2 Max commandQueue = <AGXG14XFamilyCommandQueue: 0x14db9d800> label = device = <AGXG14CDevice: 0x14db6c200> name = Apple M2 Max retainedReferences = 1 Error: command buffer exited with error status. The Metal Performance Shaders operations encoded on it may not have completed. Error: (null) Internal Error (0000000e:Internal Error) <AGXG14XFamilyCommandBuffer: 0x45fc39c00> label = device = <AGXG14CDevice: 0x14db6c200> name = Apple M2 Max commandQueue = <AGXG14XFamilyCommandQueue: 0x14db9d800> label = device = <AGXG14CDevice: 0x14db6c200> name = Apple M2 Max retainedReferences = 1 Error: command buffer exited with error status. The Metal Performance Shaders operations encoded on it may not have completed. Error: (null) Internal Error (0000000e:Internal Error) <AGXG14XFamilyCommandBuffer: 0x4b8918f40> label = device = <AGXG14CDevice: 0x14db6c200> name = Apple M2 Max commandQueue = <AGXG14XFamilyCommandQueue: 0x14db9d800> label = device = <AGXG14CDevice: 0x14db6c200> name = Apple M2 Max retainedReferences = 1 Error: command buffer exited with error status. The Metal Performance Shaders operations encoded on it may not have completed. Error: (null) Internal Error (0000000e:Internal Error) <AGXG14XFamilyCommandBuffer: 0x4b89194c0> label = device = <AGXG14CDevice: 0x14db6c200> name = Apple M2 Max commandQueue = <AGXG14XFamilyCommandQueue: 0x14db9d800> label = device = <AGXG14CDevice: 0x14db6c200> name = Apple M2 Max retainedReferences = 1 Epoch #1 loss: 145.30551147460938 Error: command buffer exited with error status. The Metal Performance Shaders operations encoded on it may not have completed. Error: (null) Internal Error (0000000e:Internal Error) <AGXG14XFamilyCommandBuffer: 0x4b89c4d40> label = device = <AGXG14CDevice: 0x14db6c200> name = Apple M2 Max commandQueue = <AGXG14XFamilyCommandQueue: 0x14db9d800> label = device = <AGXG14CDevice: 0x14db6c200> name = Apple M2 Max retainedReferences = 1 Error: command buffer exited with error status. The Metal Performance Shaders operations encoded on it may not have completed. Error: (null) Internal Error (0000000e:Internal Error) <AGXG14XFamilyCommandBuffer: 0x4b8ab0de0> label = device = <AGXG14CDevice: 0x14db6c200> name = Apple M2 Max commandQueue = <AGXG14XFamilyCommandQueue: 0x14db9d800> label = device = <AGXG14CDevice: 0x14db6c200> name = Apple M2 Max retainedReferences = 1 Error: command buffer exited with error status. The Metal Performance Shaders operations encoded on it may not have completed. Error: (null) Internal Error (0000000e:Internal Error) <AGXG14XFamilyCommandBuffer: 0x14d7d8610> label = device = <AGXG14CDevice: 0x14db6c200> name = Apple M2 Max commandQueue = <AGXG14XFamilyCommandQueue: 0x14db9d800> label = device = <AGXG14CDevice: 0x14db6c200> name = Apple M2 Max retainedReferences = 1 Error: command buffer exited with error status. The Metal Performance Shaders operations encoded on it may not have completed. Error: (null) Internal Error (0000000e:Internal Error) <AGXG14XFamilyCommandBuffer: 0x41b57b9d0> label = device = <AGXG14CDevice: 0x14db6c200> name = Apple M2 Max commandQueue = <AGXG14XFamilyCommandQueue: 0x14db9d800> label = device = <AGXG14CDevice: 0x14db6c200> name = Apple M2 Max retainedReferences = 1 Error: command buffer exited with error status. The Metal Performance Shaders operations encoded on it may not have completed. Error: (null) Internal Error (0000000e:Internal Error) <AGXG14XFamilyCommandBuffer: 0x45fdff5d0> label = device = <AGXG14CDevice: 0x14db6c200> name = Apple M2 Max commandQueue = <AGXG14XFamilyCommandQueue: 0x14db9d800> label = device = <AGXG14CDevice: 0x14db6c200> name = Apple M2 Max retainedReferences = 1 Error: command buffer exited with error status. The Metal Performance Shaders operations encoded on it may not have completed. Error: (null) Internal Error (0000000e:Internal Error) <AGXG14XFamilyCommandBuffer: 0x4b2804080> label = device = <AGXG14CDevice: 0x14db6c200> name = Apple M2 Max commandQueue = <AGXG14XFamilyCommandQueue: 0x14db9d800> label = device = <AGXG14CDevice: 0x14db6c200> name = Apple M2 Max retainedReferences = 1 Epoch #2 loss: 48.02827835083008 Epoch #3 loss: nan Training completed.

Process finished with exit code 0

At least is not failing completely and I am not sure if these are really warnings. Going to train the large model and compare the time and the actual usage of the GPU.

AA

__ Alejandro Ayestaran @.***

On Apr 22, 2024, at 10:59 AM, Alejandro Ayestaran @.***> wrote:

The last version I can update with pip is 2.2.0 (not even 2.3 which is supposed to be released in April) and the issue still present in that version. I am glad that with 2.4.0 is working, although that version is not official until July 2024 … It would be nice to know: (1) When is the official release date (day) for 2.3 (2) Is this bug/issue fixed in that version.

Regards, AA. __ Alejandro Ayestaran @.***

On Apr 21, 2024, at 9:30 PM, Yeison Rodriguez @.***> wrote:

(2.4.0) and now the issue is resolved. Not sure if there was a fix.

pytorch / pytorch

Placeholder tensor is empty! #123171

🐛 Describe the bug

Check PyTorch has access to MPS (Metal Performance Shader, Apple's GPU architecture)

Set the device

Set the device

Create data and send it to the device

Using weights parameter instead of pretrained

Versions